Re: Document aware analyzers was Re: deprecating Versions

Grant Ingersoll Wed, 01 Dec 2010 11:26:22 -0800

On Dec 1, 2010, at 8:07 AM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 8:01 AM, Grant Ingersoll <gsing...@apache.org> wrote:
> 
>> While we are at it, how about we make the Analysis process document aware 
>> instead of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly 
>> what it says it does, is just silly.  If you had an analysis process that 
>> was aware, if it chooses to be, of the document as a whole then you open up 
>> a whole lot more opportunity for doing interesting analysis while losing 
>> nothing towards the individual treatment of fields.  The TeeSink stuff is an 
>> attempt at this, but it is not sufficient.
>> 
> 
> I'm not sure I like this: traditionally we let the user application
> deal with "document parsing" (how do you take your content and define
> it as documents/fields).


Nah, I just meant analysis would often benefit from having knowledge of the 
document as a whole instead of just the individual field.  

> 
> If we want to change lucene to start dealing with this "document
> parsing" aspect, thats pretty scary in itself, but in my opinion the
> very last choice of where we would want to add something like that is
> analysis! So personally I really like analysis being separate from
> document parsing: our analysis API is already way too complicated.

Yes, I agree.


> 
> Maybe if you give a concrete example then I would have a better
> understanding of the problem you think this might solve.

Let me see if I can put some flesh on the bones.  I'm assuming the raw document 
has already been parsed and that we are still basically dealing with strings 
and that we have a document which contains one or more fields.

If we step back and look at our analysis process, there are some things that 
are easy and some things that are hard that maybe shouldn't be because even 
though we talk like we are indexing and search documents, we are really 
indexing and searching fields and everything is Field centric.  That works fine 
for the easy analysis things like tokenization, stemming, lowercasing, etc. 
when all the content is in one language.  It doesn't work well when you have 
multiple languages in a single document or if you want to do things like 
Tee/Sink or even something as simple as Solr's copy field semantics.  The fact 
that we have PerFieldAnalyzerWrapper is a symptom of this.  The clunkiness of 
the TeeSinkTokenFilter is also another one.  Handling auto language 
identification is another.  The end result of all of these things is you often 
have to do analysis work twice (or more) for the same piece of content when I 
believe that an analysis process that knew a document had multiple fields 
(which seems like a given) might lead to more efficiencies because repeated 
analysis work could be shared and also because work that inherently crosses 
multiple fields on the same document or selects a particular field out of a 
choice of several can be handled more cleanly.

So, you as the developer would still need to define out what your fields are 
and what analysis you want done for each of those fields, but we, as Lucene 
developers, might be able to make things more efficient if we can recognize 
commonalities, etc. as well as offer users tools that make it easy to work 
across fields.

At any rate, this is all just food for thought.  I don't have any proposed API 
changes at this point.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Reply via email to