Re: [jira] Commented: (LUCENE-778) Allow overriding a Document

Nicolas Lalevée Sat, 20 Jan 2007 04:54:32 -0800

Very interesting discussion. It matches some ideas I had about how Lucene 
works, I just wasn't sure of their relevance, only trying to hack Lucene for 
few months.

I love the idea of decoupling the document being indexed, and the document 
being extracted from the index. It joins also some comments in the code of 
IndexReader :
  //When we convert to JDK 1.5 make this Set<String>
  public abstract Document document(int n, FieldSelector fieldSelector) throws 
IOException;
Here it shouldn't be Set<String>, but sort of Set<Fieldable>. But the idea is 
here.

The idea behind LUCENE-778 was just allowing custom document indexing, just as 
Grant's idea of DBFieldable.

Then there is the document extracting part. I have done some work on it with 
LUCENE-662. Some ideas in this thread talked about allowing changing the 
default instanciation of Document by some system property setup. I don't like 
this idea because it doesn't allow going forward a generic Java way of typing 
classes. It will work, but I think we can do better.
The basic idea is providing a document factory, which can be parametrized: a 
sort of DocumentFactory<ResultDocument>. Then this factory is used by the 
FieldReader<ResultDocument> and provide some filled field instances of 
ResultDocument. And finnaly the IndexReader<ResultDocument> will provide 
ResultDocument instance.
From the user point of view of Lucene, this would be fantastic. Instanciating 
an IndexReader<MyAppDocument>, and then get some MyAppDocument without any 
cast to do.

I aslo tried to go even further in decoupling indexing/searching from 
storing/extracting. On one hand, specify what to index and how, using curent 
Document design with Field. On the other hand, specify what to store and how, 
allowing to store it in a DB. So adding a document to the index is creating a 
Document with only indexed fields, and some document data, not necessarily 
organized by fields. Then the DocumentWriter will index fields, as it does 
today, and with a provided implementation of a DocumentDataStorage, store the 
document data. At the reverse, when extracting a document from an index, in 
fact it will extract only the docuement data with the same implementation of 
the DocumentDataStorage.
Then I realized that Lucene allows it already. With a such design, Lucene will 
have to keep inside a mapping between the document id and a document data id 
provided by a DocumentDataStorage. And in fact, this is simply, with the 
current Lucene, a simple special stored field added to the document. The only 
advantage a such design has is that Lucene will provide very flexible tools 
to store data. It would allow two different merge policy between some index 
segment and store segment; so there will be an extracted merge policy from 
the IndexWriter, abstraction of the segment notion and so on. But I don't 
think this is the goal of Lucene, which is indexing and searching. (or maybe 
for a Lucene 3, 4 ? %) )

BTW, providing customized implementation of Document will be cool. In my 
application, I have just done a wrapper, which is simply instanciating with a 
special contructor : MyAppDocument(Document doc).

For LUCENE-662, I have tried to make it Java-5-generic-type aware. I have not 
proposed a patch because Lucene doesn't yet support Java-5. If people are 
interested, just to see how it would be, I can finish making it proper and 
publish it in Jira.

Nicolas

Le Vendredi 19 Janvier 2007 23:04, Grant Ingersoll a écrit :
> Yes, duh.  Was writing and not thinking!
>
> On Jan 19, 2007, at 3:49 PM, Chris Hostetter wrote:
> > : Yes, I was suggesting this in light of your suggestions :-)
> >
> > Document
> >
> > : would have to be non-final for this to work.
> >
> > No ... Document as it is with all of it's methods for both being
> > indexed
> > and for being returned from a search could still be final -- it
> > would just
> > need to impliment these new interfaces.  the key would be having new
> > nethods in IndexReader/IndexWriter/IndexSearcher that used these new
> > methods.
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-778) Allow overriding a Document

Reply via email to