Uwe,
Is this example available? I think that an example like this would help
the user community see the current value in the change. At least, I'd
love to see the code for it.
-- DM
On 08/10/2009 06:49 PM, Uwe Schindler wrote:
> UIMA....
The new API looks like UIMA, you have streams that are attributed with
various attributes that can be exchanged between
TokenStreams/TokenFilters. Just like the current FlagsAttribute or
TypeAttribute, that can easily misused for such things.
About a real use case for the new API:
I talked some time ago with Grant in the podcast about NumericRange
and the Publishing Network for Geoscientific Date called PANGAEA. At
the end of the talk (available on the Lucid Imagination website),
there were some explanations, how we index our XML documents that one
could ask for contents of a specific XML element name (element name is
field name) or a XPath-like path as field name. E.g. if you have an
XML document like this: http://www.pangaea.de/PHP/getxml.php/51675
(please note: this is just a very simple XML schema we use for
indexing our documents). When we index this document type into Lucene,
we create a new field for each element name, e.g. "lastName",
"firstName" and so on. One could easily search for any document where
anywhere (not only in citation), a specific "lastName" appears. We
also create fields for more general element names. So you could also
look inside field name "citation", to search anywhere in the citation.
You could also combine, to only find documents where the "lastName" of
an "author" is "Xyz" by using the field name "author:lastName". In the
past (before the new API, I wrote this analyzer very complicated and
created StringBuffers for earch element name, where I appended the
text and then analyzed it for each field name again.
Now I pass the XML document in my special XMLTokenStream that uses
STAX/DOM to retrieve the element names and contents. Each element
creates a new TermAttribute (with the whole contents as one term) and
a custom Attribute holding the reference to the current element name
and all previous higher level element names (the Attribute contains a
Stack of element names). This special Attribute is then in the
Tokenizer chain and only updated by the root XMLTokenStream. The next
filter in the chain is a WhitespaceFilter (that splits up the tokens
at white space) and so on to further tokenize the element contents.
The special element name stack attribute is untouched, but always
contains the current element name for later filtering. The last step
is using the new TeeSinkTokenFilter to index the stream into different
fields. The TeeSinkTokenFilter gets Sinks for each field name/element
name hierarchy (which are recorded before), each Sink filters the
Tokens using the special element stack attribute for matching tokens
the field is interested. By that I can simply analyze the whole XML
document one time and distribute the contents to various field names
using the additional attribute.
Here an example (using the above schema), that shows all documents
with a title of "Evidence from Fram Strait" in the publication where
the dataset is attached to as supplement:
http://www.pangaea.de/search?q=supplementTo%3Atitle%3A%22Evidence+from+Fram+Strait%22
(which hits only the above example). The query parser is customized
(not the Lucene one).
The final code of this TokenStream is a little bit more complicated
that described here, but it gives a possible usage of the new API:
Annotate tokens with field identifiers to e.g. automatically put the
title of a document in a title field and the authors in another one
and so on.
I hope somebody understood, what we are doing here J
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
------------------------------------------------------------------------
*From:* Shai Erera [mailto:ser...@gmail.com]
*Sent:* Monday, August 10, 2009 11:13 PM
*To:* java-dev@lucene.apache.org
*Subject:* Re: who clears attributes?
It sounds like the 'old' API should stay a bit longer than 3.0. We'd
like to give more people a chance to experiment w/ the new API before
we claim it is the new Analysis API in Lucene. And that means that
more users will have to live w/ the "bit of slowness" more than what
is believed in this thread.
I personally worry much about needing to throw away the current API.
I'll have a lot of code to port over and I haven't read anything so
far that convinces me the new API is better. I don't have any problems
w/ the current API today. I feel I have all the flexibility I need w/
indexing fields. I use payloads, Field.Index constants, write
Analyzers, TokenStreams ... actually I have 0 complaints.
Maybe we should follow what I seem to read from Earwin and Grant -
come up w/ real use cases, try to implement them w/ the current API,
then if it's impossible, discuss how we can make the current API more
adaptive. If at the end of this we'll get back to the new API, then
we'll at least feel better about it, and more convinced it is the way
to go.
Hack .. maybe we'll be convinced to base the Luceue analysis on UIMA? :)
Shai
On Mon, Aug 10, 2009 at 11:54 PM, Uwe Schindler <u...@thetaphi.de
<mailto:u...@thetaphi.de>> wrote:
> >> I have serious doubts about releasing this new API until these
> >> performance issues are resolved and better proven out from a
> >> usability
> >> standpoint.
> >
> > I think LUCENE-1796 has fixed the performance problems, which was
> > caused by
> > a missing reflection-cache needed for bw compatibility. I hope to
> > commit
> > soon!
> >
> > 2.9 may be a little bit slower when you mix old and new API and do
> > not reuse
> > Tokenizers (but Robert is already adding reusableTokenStream to all
> > contrib
> > analyzers). When the backwards layer is removed completely or
> > setOnlyUseNewAPI is enabled, there is no speed impact at all.
> >
>
>
> The Analysis features of Lucene are the single most common place where
> people enhance Lucene. Very few add queries, or muck with field
> caches, but they do write their own Analyzers and TokenStreams,
> etc. Within that, mixing old and new is likely the most common case
> for everyone who has made their own customizations, so a "little bit
> slower" is something I'd rather not live with just for the sake of
> some supposed goodness in a year or two.
But because of this flexibility, we added the backwards layer. The old
style
with setUseNewAPI was not flexible at all, and nobody would move his
Tokenizers to the new API without that flexibility (maybe he uses external
analyzer packages not yet updated).
With "a little bit" I mean the cost of wrapping the old and new API is
really minimal, it is just an if statement and a method call, hopefully
optimized away by the JVM. In my tests the standard deviation between
different test runs was much higher than the difference between mixing
old/new API (on Win32), so it is not really sure, that the cost comes from
the delegation.
The only case that is really slower is (now minimized cost of creation in
TokenStream.<init>, if you not reuse TokenStreams: Two LinkedHashMaps have
to be created and setup). But this is not caused by the backwards layer.
Uwe
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
<mailto:java-dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-dev-h...@lucene.apache.org
<mailto:java-dev-h...@lucene.apache.org>