Re: Lucene crawler plan

Peter Becker Tue, 01 Jul 2003 15:35:13 -0700

Erik Hatcher wrote:

On Monday, June 30, 2003, at 10:21 PM, Peter Becker wrote:

this is far closer to what we are looking for. Using Ant is an interesting idea, although it probably won't help us for the UI tool. But we could try to layer things so we could use them for both

Yes, I'm sure a more generalized method could be developed that accomodates both. Its pretty decoupled even within the Ant project with a DocumentHandler interface and all.

And frankly -- these little code pieces are easy to port. The trick is knowing which library to use and how.

Two differences between the Ant project and what we do right now: - the Ant project doesn't have a notion of an explicit file filter. I think this is important if you want to extend the filter options to more than just extensions and if you want some UI to manage the filter mappings. BTW: does anyone know of a Java implementation for file(1) magic?

Ah, but Ant *does* have more sophisticated filtering mechanisms! :) The <fileset>'s that the <index> task can take can leverage any of Ant's built-in capabilities, such as (new in Ant 1.5) Selector capability. So you could easily filter on file size, file date, etc, and custom Selectors can be written and plugged in.

Ant does. What I meant with the Ant project was the code in the Lucene CVS for Ant. The decision between the two DocumentHandlers seems to be made based on the extension. But maybe I didn't read the code properly.

What I want to see is a user-defined mapping from some kinds of FileFilters (extension, wildcard, regexp, magic numbers, whatever) to the DocumentHandlers. They should be applied in order and whenever one hits the iteration stops unless an exception gets thrown by the DocumentHandler. Additional DocumentHandlers could be mixed in to provide extra information. I am thinking of file system information and metadata stores here. These would be an independent dimension of data about the documents.

- the code creates Documents as return values. The reason we went away from this is that we want to use the same document handler with different index options. One of the core issues here is storing the body or not. I don't think there is any true answer for this one, so it should be configurable somehow.

Agreed. It was a toss-up when I went to implement as who is actually in control of the Document instantiation and population.

The two options I see are either returning a data object and then turning that into a Document somewhere else or passing some configuration object around. Both are not really nice, the first one needs to create an additional object all the time, while the second one puts quite some burder on the implementer of the document handler. Ideas on that one would be extremely welcome.

If you invert what I have done then the "controller" needs to know more information about the fields, more than you could convey in a String/String Map - is a field indexed or not? Is a field tokenized or not? Is it stored or not? Who decides on the field names? Who decides all of these are the questions we have to answer to do this type of stuff.

Exactly. Somehow these issues should be separated from the issue of finding the data. Our current idea is to collect everything in a data object and then get some other code to turn it into a Lucene Document. Another version would be a wrapper/factory/strategy around the Lucene Document doing the mapping.

The field name question would be separated this way, but one question would be left: what are the fields. The idea of having the extra Properties field doesn't really help that much, since then we are back to where we started. Giving a big range of default fields (along Dublin Core?) would help, but would be quite some overkill. It could be expensive in terms of object creation, too -- the wrapper approach would probably better here.

Two ideas we will probably pick up from this are: - use Ant for creating indexes if we go larger than personal document retrieval

Keep in mind you could also launch Ant via the API from a GUI as well, or just leverage the IndexTask itself and call it via the API and its execute() method.

I'll investigate this. Thanks.

- use JTidy for HTML parsing (we missed that one and used Swing instead, which is no good)

I think there are probably some better options out there than using JTidy these days, but I have not had time to investigate them. JTidy does the job reasonably well though.

We are looking into some alternatives. We have a few ten thousand documents to test on :-) I suspect we will just implement whatever comes along and let them run, collecting exceptions and time eaten. Checking if they really got all interesting content will be too much work, though.

What are the issues with JTidy?

Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene crawler plan

Reply via email to