Re: who clears attributes?

DM Smith Tue, 11 Aug 2009 08:54:21 -0700

Uwe,

Is this example available? I think that an example like this would helpthe user community see the current value in the change. At least, I'dlove to see the code for it.


-- DM

On 08/10/2009 06:49 PM, Uwe Schindler wrote:

> UIMA....
The new API looks like UIMA, you have streams that are attributed withvarious attributes that can be exchanged betweenTokenStreams/TokenFilters. Just like the current FlagsAttribute orTypeAttribute, that can easily misused for such things.
About a real use case for the new API:
I talked some time ago with Grant in the podcast about NumericRangeand the Publishing Network for Geoscientific Date called PANGAEA. Atthe end of the talk (available on the Lucid Imagination website),there were some explanations, how we index our XML documents that onecould ask for contents of a specific XML element name (element name isfield name) or a XPath-like path as field name. E.g. if you have anXML document like this: http://www.pangaea.de/PHP/getxml.php/51675(please note: this is just a very simple XML schema we use forindexing our documents). When we index this document type into Lucene,we create a new field for each element name, e.g. "lastName","firstName" and so on. One could easily search for any document whereanywhere (not only in citation), a specific "lastName" appears. Wealso create fields for more general element names. So you could alsolook inside field name "citation", to search anywhere in the citation.You could also combine, to only find documents where the "lastName" ofan "author" is "Xyz" by using the field name "author:lastName". In thepast (before the new API, I wrote this analyzer very complicated andcreated StringBuffers for earch element name, where I appended thetext and then analyzed it for each field name again.
Now I pass the XML document in my special XMLTokenStream that usesSTAX/DOM to retrieve the element names and contents. Each elementcreates a new TermAttribute (with the whole contents as one term) anda custom Attribute holding the reference to the current element nameand all previous higher level element names (the Attribute contains aStack of element names). This special Attribute is then in theTokenizer chain and only updated by the root XMLTokenStream. The nextfilter in the chain is a WhitespaceFilter (that splits up the tokensat white space) and so on to further tokenize the element contents.The special element name stack attribute is untouched, but alwayscontains the current element name for later filtering. The last stepis using the new TeeSinkTokenFilter to index the stream into differentfields. The TeeSinkTokenFilter gets Sinks for each field name/elementname hierarchy (which are recorded before), each Sink filters theTokens using the special element stack attribute for matching tokensthe field is interested. By that I can simply analyze the whole XMLdocument one time and distribute the contents to various field namesusing the additional attribute.
Here an example (using the above schema), that shows all documentswith a title of "Evidence from Fram Strait" in the publication wherethe dataset is attached to as supplement:http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+Strait%22(which hits only the above example). The query parser is customized(not the Lucene one).
The final code of this TokenStream is a little bit more complicatedthat described here, but it gives a possible usage of the new API:Annotate tokens with field identifiers to e.g. automatically put thetitle of a document in a title field and the authors in another oneand so on.
I hope somebody understood, what we are doing here J

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

------------------------------------------------------------------------

*From:* Shai Erera [mailto:ser...@gmail.com]
*Sent:* Monday, August 10, 2009 11:13 PM
*To:* java-dev@lucene.apache.org
*Subject:* Re: who clears attributes?
It sounds like the 'old' API should stay a bit longer than 3.0. We'dlike to give more people a chance to experiment w/ the new API beforewe claim it is the new Analysis API in Lucene. And that means thatmore users will have to live w/ the "bit of slowness" more than whatis believed in this thread.
I personally worry much about needing to throw away the current API.I'll have a lot of code to port over and I haven't read anything sofar that convinces me the new API is better. I don't have any problemsw/ the current API today. I feel I have all the flexibility I need w/indexing fields. I use payloads, Field.Index constants, writeAnalyzers, TokenStreams ... actually I have 0 complaints.
Maybe we should follow what I seem to read from Earwin and Grant -come up w/ real use cases, try to implement them w/ the current API,then if it's impossible, discuss how we can make the current API moreadaptive. If at the end of this we'll get back to the new API, thenwe'll at least feel better about it, and more convinced it is the wayto go.
Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :)

Shai
On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <u...@thetaphi.de<mailto:u...@thetaphi.de>> wrote:
> >> I have serious doubts about releasing this new API until these
> >> performance issues are resolved and better proven out from a
> >> usability
> >> standpoint.
> >
> > I think LUCENE-1796 has fixed the performance problems, which was
> > caused by
> > a missing reflection-cache needed for bw compatibility. I hope to
> > commit
> > soon!
> >
> > 2.9 may be a little bit slower when you mix old and new API and do
> > not reuse
> > Tokenizers (but Robert is already adding reusableTokenStream to all
> > contrib
> > analyzers). When the backwards layer is removed completely or
> > setOnlyUseNewAPI is enabled, there is no speed impact at all.
> >
>
>
> The Analysis features of Lucene are the single most common place where
> people enhance Lucene.  Very few add queries, or muck with field
> caches, but they do write their own Analyzers and TokenStreams,
> etc.    Within that, mixing old and new is likely the most common case
> for everyone who has made their own customizations, so a "little bit
> slower" is something I'd rather not live with just for the sake of
> some supposed goodness in a year or two.
But because of this flexibility, we added the backwards layer. The oldstyle
with setUseNewAPI was not flexible at all, and nobody would move his
Tokenizers to the new API without that flexibility (maybe he uses external
analyzer packages not yet updated).

With "a little bit" I mean the cost of wrapping the old and new API is
really minimal, it is just an if statement and a method call, hopefully
optimized away by the JVM. In my tests the standard deviation between
different test runs was much higher than the difference between mixing
old/new API (on Win32), so it is not really sure, that the cost comes from
the delegation.

The only case that is really slower is (now minimized cost of creation in
TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have
to be created and setup). But this is not caused by the backwards layer.

Uwe




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org<mailto:java-dev-unsubscr...@lucene.apache.org>For additional commands, e-mail: java-dev-h...@lucene.apache.org<mailto:java-dev-h...@lucene.apache.org>

Re: who clears attributes?

Reply via email to