Uwe,

Is this example available? I think that an example like this would help the user community see the current value in the change. At least, I'd love to see the code for it.

-- DM

On 08/10/2009 06:49 PM, Uwe Schindler wrote:

> UIMA....

The new API looks like UIMA, you have streams that are attributed with various attributes that can be exchanged between TokenStreams/TokenFilters. Just like the current FlagsAttribute or TypeAttribute, that can easily misused for such things.

About a real use case for the new API:

I talked some time ago with Grant in the podcast about NumericRange and the Publishing Network for Geoscientific Date called PANGAEA. At the end of the talk (available on the Lucid Imagination website), there were some explanations, how we index our XML documents that one could ask for contents of a specific XML element name (element name is field name) or a XPath-like path as field name. E.g. if you have an XML document like this: http://www.pangaea.de/PHP/getxml.php/51675 (please note: this is just a very simple XML schema we use for indexing our documents). When we index this document type into Lucene, we create a new field for each element name, e.g. "lastName", "firstName" and so on. One could easily search for any document where anywhere (not only in citation), a specific "lastName" appears. We also create fields for more general element names. So you could also look inside field name "citation", to search anywhere in the citation. You could also combine, to only find documents where the "lastName" of an "author" is "Xyz" by using the field name "author:lastName". In the past (before the new API, I wrote this analyzer very complicated and created StringBuffers for earch element name, where I appended the text and then analyzed it for each field name again.

Now I pass the XML document in my special XMLTokenStream that uses STAX/DOM to retrieve the element names and contents. Each element creates a new TermAttribute (with the whole contents as one term) and a custom Attribute holding the reference to the current element name and all previous higher level element names (the Attribute contains a Stack of element names). This special Attribute is then in the Tokenizer chain and only updated by the root XMLTokenStream. The next filter in the chain is a WhitespaceFilter (that splits up the tokens at white space) and so on to further tokenize the element contents. The special element name stack attribute is untouched, but always contains the current element name for later filtering. The last step is using the new TeeSinkTokenFilter to index the stream into different fields. The TeeSinkTokenFilter gets Sinks for each field name/element name hierarchy (which are recorded before), each Sink filters the Tokens using the special element stack attribute for matching tokens the field is interested. By that I can simply analyze the whole XML document one time and distribute the contents to various field names using the additional attribute.

Here an example (using the above schema), that shows all documents with a title of "Evidence from Fram Strait" in the publication where the dataset is attached to as supplement: http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+Strait%22 (which hits only the above example). The query parser is customized (not the Lucene one).

The final code of this TokenStream is a little bit more complicated that described here, but it gives a possible usage of the new API: Annotate tokens with field identifiers to e.g. automatically put the title of a document in a title field and the authors in another one and so on.

I hope somebody understood, what we are doing here J

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

------------------------------------------------------------------------

*From:* Shai Erera [mailto:ser...@gmail.com]
*Sent:* Monday, August 10, 2009 11:13 PM
*To:* java-dev@lucene.apache.org
*Subject:* Re: who clears attributes?

It sounds like the 'old' API should stay a bit longer than 3.0. We'd like to give more people a chance to experiment w/ the new API before we claim it is the new Analysis API in Lucene. And that means that more users will have to live w/ the "bit of slowness" more than what is believed in this thread.

I personally worry much about needing to throw away the current API. I'll have a lot of code to port over and I haven't read anything so far that convinces me the new API is better. I don't have any problems w/ the current API today. I feel I have all the flexibility I need w/ indexing fields. I use payloads, Field.Index constants, write Analyzers, TokenStreams ... actually I have 0 complaints.

Maybe we should follow what I seem to read from Earwin and Grant - come up w/ real use cases, try to implement them w/ the current API, then if it's impossible, discuss how we can make the current API more adaptive. If at the end of this we'll get back to the new API, then we'll at least feel better about it, and more convinced it is the way to go.

Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :)

Shai

On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <u...@thetaphi.de <mailto:u...@thetaphi.de>> wrote:

> >> I have serious doubts about releasing this new API until these
> >> performance issues are resolved and better proven out from a
> >> usability
> >> standpoint.
> >
> > I think LUCENE-1796 has fixed the performance problems, which was
> > caused by
> > a missing reflection-cache needed for bw compatibility. I hope to
> > commit
> > soon!
> >
> > 2.9 may be a little bit slower when you mix old and new API and do
> > not reuse
> > Tokenizers (but Robert is already adding reusableTokenStream to all
> > contrib
> > analyzers). When the backwards layer is removed completely or
> > setOnlyUseNewAPI is enabled, there is no speed impact at all.
> >
>
>
> The Analysis features of Lucene are the single most common place where
> people enhance Lucene.  Very few add queries, or muck with field
> caches, but they do write their own Analyzers and TokenStreams,
> etc.    Within that, mixing old and new is likely the most common case
> for everyone who has made their own customizations, so a "little bit
> slower" is something I'd rather not live with just for the sake of
> some supposed goodness in a year or two.

But because of this flexibility, we added the backwards layer. The old style
with setUseNewAPI was not flexible at all, and nobody would move his
Tokenizers to the new API without that flexibility (maybe he uses external
analyzer packages not yet updated).

With "a little bit" I mean the cost of wrapping the old and new API is
really minimal, it is just an if statement and a method call, hopefully
optimized away by the JVM. In my tests the standard deviation between
different test runs was much higher than the difference between mixing
old/new API (on Win32), so it is not really sure, that the cost comes from
the delegation.

The only case that is really slower is (now minimized cost of creation in
TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have
to be created and setup). But this is not caused by the backwards layer.

Uwe




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org <mailto:java-dev-unsubscr...@lucene.apache.org> For additional commands, e-mail: java-dev-h...@lucene.apache.org <mailto:java-dev-h...@lucene.apache.org>


Reply via email to