Re: Solr UIMA Notes

Lance Norskog Mon, 27 Aug 2012 15:59:01 -0700

I recently went through a similar exercise with adding a large suite
of stuff to Lucene & Solr. Strangely, it is for OpenNLP, another
natural-language-processing toolkit.


2) Test of Lucene code from Solr unit tests.
The problem is that the Lucene code requires a bunch of configuration.
To write a unit test directly in Lucene, you have to duplicate the
Solr factories. So, you test them through the Solr factories.

6) Memory size for unit tests
I added this to contrib/uima/build.xml. It throttles the
multi-threaded unit tests down to one thread. I also changed my unit
tests to flush out any cached data across tests.
 <property name="tests.jvms" value="1" />

7,9) Adding UIMA to standard example.
There is politics here, about whether to have a big example or a small
example. I would use both.  We should have a small example to
demonstrate the basics, useful as a starter kit. And also have a giant
example with every last thing in the package. We just now discovered
that example/example-DIH did not work, because nobody tested it (or
reported the problem.)  We know that people will run the giant example
and find problems.

On Mon, Aug 27, 2012 at 2:25 PM, Tommaso Teofili
<[email protected]> wrote:
> Hi Eric,
>
> 2012/8/10 Eric Pugh <[email protected]>
>>
>> Hi all,
>>
>> I've been working through the SolrUIMA demo, and have some changes to
>> propose based on going through it to make the UIMA stuff more accessible to
>> a new user.  Since JIRA is down, I thought I would email my notes to the
>> list and see if anyone can clarify my questions.
>>
>> Eric
>>
>>
>> 1) The class org.apache.lucene.analysis.uima.ae.OverridingParamsAEProvider
>> specifically mentions that it is used to take params supplied by Solr's
>> solrconfig.xml and feed them into the AnalysisEngine.  While no Solr imports
>> exist, so it could be used with anything, it seems odd that the phrasing for
>> a Lucene class refers to Solr.  Changing the phrasing from "injecting
>> runtime parameters defined in the solrconfig.xml Solr configuration file" to
>> "injecting runtime parameters such as those defined in the Solr
>> solrconfig.xml configuration file" might make the intent clearer and explain
>> why it isn't in a  Solr package, even though we have a Solr contrib module
>> for UIMA.
>
>
> yep, it's due to the fact that those o.a.lucene.uima.ae classes where Solr
> "citizens" while when we created the UIMA tokenizers we realized that it was
> good to have the factory classes available for both therefore they were
> moved to lucene/analysis/uima but you're right the javadoc should be
> adjusted.
>
>>
>>
>> 2) The tests
>> org.apache.solr.uima.analysis.UIMAAnnotationsTokenizerFactoryTest and
>> UIMATypeAwareAnnotationsTokenizerFactoryTest test code that is in the
>> o.a.lucene structure, but with all the overhead of using Solr.  There is no
>> corresponding test in the o.a.lucene path for those factory classes.
>
>
> these two tests are explicitly for the Solr factories that are meant to be
> declared in a Solr schema, the tests in the lucene/analysis/uima module are
> UIMABaseAnalyzerTest (for UIMAAnnotationsTokenizer generated Analyzer) and
> UIMATypeAwareAnalyzerTest (for the TypeAware related Analyzer).
>
>>
>>
>> 3) When going through the http://wiki.apache.org/solr/SolrUIMA/ tutorial,
>> it's very odd that you flip from the wiki page to content that is stored in
>> SVN and back as you follow the directions.  Especially since the bits of
>> sample config in SVN aren't used by tests or anything else.  I'd like to
>> move them to just the wiki, so they are easier to edit and keep up to date.
>
>
> +1
>
>>
>>
>> 4) When looking at the test files we have annotation engines with names
>> like "org.apache.solr.uima.ts.SentimentAnnotation".  However, they don't
>> exist as classes in the main source tree!  And when you go down the rabbit
>> hole, you eventually end up at a Java class called
>> org.apache.solr.uima.processor.an.DummySentimentAnnotator that actually is
>> the aforementioned annotator!  I'd like to change the test code so that we
>> actually are at least using something called
>> "org.apache.solr.uima.ts.DummySentimentAnnotation" or even
>> "org.apache.solr.uima.processor.an.DummySentimentAnnotator"!    I got very
>> excited that out of the box demo had sentiment analysis, and it really
>> didn't, just some mock code.
>
>
> maybe just changing SentimentAnnotation to DummySentimentAnnotation would
> make things more consistent and avoid confusion.
>
>>
>>
>> 5) It appears that when you pass a multivalued field through to UIMA, only
>> the first value is actually submitted to Solr.  If my XML (solr.xml from
>> example docs) looks like:
>>
>>   <field name="features">Advanced Full-Text Search Capabilities using
>> Lucene</field>
>>   <field name="features">Optimized for High Volume Web Traffic</field>
>>
>> Then what gets processed is only the text "Advanced Full-Text Search
>> Capabilities using Lucene"!  I have a separate patch I will submit that uses
>> getFieldValues() instead of getFieldValue() method on a SolrInputDocument.
>
>
> this sounds like a bug, if you want to open a Jira issue / submit a patch
> you're more than welcome, otherwise I can do that.
>
>>
>>
>> 6) You need to bump your memory allocation!  -Xmx1024m -Xms512m, or it
>> WILL run out of heap space when running tests.
>
>
> I was not aware of that, I'll give it a try with a very small heap.
>
>>
>>
>> 7) I'd like to move the UIMA xml files etc into the /conf directory,
>> instead of accessing the files that are inside the JAR file.  Much easier to
>> hack on.  I copied solr/contrib/uima/src/resources/*.xml into
>> solr/example/solr/collection1/conf/uima, and access it via:
>>         <!--str
>> name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str-->
>>         <str
>> name="analysisEngine">solr/${solr.core.instanceDir}/conf/uima/OverridingParamsExtServicesAE.xml</str>
>
>
> ok, sounds good even if the mentioned file is in
> src/org/apache/uima/desc/resources which can be edited easily for "playing"
> with the tests.
>
>>
>>
>> 8) It appears like for each annotation, I can only use the last "feature"
>> defined.   This doesn't work:
>>           <lst name="type">
>>             <str
>> name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
>>             <lst name="mapping">
>>               <str name="feature">language</str>
>>               <str name="field">language</str>
>>             </lst>
>>           </lst>
>>           <lst name="type">
>>             <str
>> name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
>>             <lst name="mapping">
>>               <str name="feature">wikipedia</str>
>>               <str name="field">language_wikipedia</str>
>>             </lst>
>>           </lst>
>>
>>
>> Okay, figured it out finally,  and it has to look like this inside a type
>> definition:
>>             <lst name="mapping">
>>               <str name="feature">wikipedia</str>
>>               <str name="field">language_wikipedia</str>
>>             </lst>
>>             <lst name="mapping">
>>               <str name="feature">language</str>
>>               <str name="field">language</str>
>>             </lst>
>>             <lst name="mapping">
>>               <str name="feature">ethnologue</str>
>>                           <str name="fieldNameFeature">language</str>
>>               <str name="dynamicField">*_sm</str>
>>             </lst>
>>
>
> sure the latter is how it's supposed to work, as features are related to one
> single type.
>
>>
>>
>>
>> 9) I'd like to patch the default solrconfig.xml to include the UIMA jars,
>> and move the config files over to /conf/uima, and then just comment out the
>> example.  Do we think that this is a good thing? Since you have to have an
>> AlchemyAPI key, we could just have the code do the sentence parsing as the
>> example, and comment out the alchemyAPI keys in solrconfig.xml.  Or, just
>> leave them in the source tree, and document the steps?
>
>
> I assume that just adding the elements for importing the libs could be ok,
> we should instead avoid adding the AlchemyAPI AE by default due to the key
> setting.
> I think the best option is open separate Jira tickets for the above tasks
> and discuss them more deeply there.
> Thanks for your effort Eric.
>
> Regards,
> Tommaso
>
>>
>>
>>
>>
>>
>>
>> -----------------------------------------------------
>> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com
>> Co-Author: Apache Solr 3 Enterprise Search Server available from
>> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless of
>> whether attachments are marked as such.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>



-- 
Lance Norskog
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Solr UIMA Notes

Reply via email to