I recently went through a similar exercise with adding a large suite of stuff to Lucene & Solr. Strangely, it is for OpenNLP, another natural-language-processing toolkit.
2) Test of Lucene code from Solr unit tests. The problem is that the Lucene code requires a bunch of configuration. To write a unit test directly in Lucene, you have to duplicate the Solr factories. So, you test them through the Solr factories. 6) Memory size for unit tests I added this to contrib/uima/build.xml. It throttles the multi-threaded unit tests down to one thread. I also changed my unit tests to flush out any cached data across tests. <property name="tests.jvms" value="1" /> 7,9) Adding UIMA to standard example. There is politics here, about whether to have a big example or a small example. I would use both. We should have a small example to demonstrate the basics, useful as a starter kit. And also have a giant example with every last thing in the package. We just now discovered that example/example-DIH did not work, because nobody tested it (or reported the problem.) We know that people will run the giant example and find problems. On Mon, Aug 27, 2012 at 2:25 PM, Tommaso Teofili <[email protected]> wrote: > Hi Eric, > > 2012/8/10 Eric Pugh <[email protected]> >> >> Hi all, >> >> I've been working through the SolrUIMA demo, and have some changes to >> propose based on going through it to make the UIMA stuff more accessible to >> a new user. Since JIRA is down, I thought I would email my notes to the >> list and see if anyone can clarify my questions. >> >> Eric >> >> >> 1) The class org.apache.lucene.analysis.uima.ae.OverridingParamsAEProvider >> specifically mentions that it is used to take params supplied by Solr's >> solrconfig.xml and feed them into the AnalysisEngine. While no Solr imports >> exist, so it could be used with anything, it seems odd that the phrasing for >> a Lucene class refers to Solr. Changing the phrasing from "injecting >> runtime parameters defined in the solrconfig.xml Solr configuration file" to >> "injecting runtime parameters such as those defined in the Solr >> solrconfig.xml configuration file" might make the intent clearer and explain >> why it isn't in a Solr package, even though we have a Solr contrib module >> for UIMA. > > > yep, it's due to the fact that those o.a.lucene.uima.ae classes where Solr > "citizens" while when we created the UIMA tokenizers we realized that it was > good to have the factory classes available for both therefore they were > moved to lucene/analysis/uima but you're right the javadoc should be > adjusted. > >> >> >> 2) The tests >> org.apache.solr.uima.analysis.UIMAAnnotationsTokenizerFactoryTest and >> UIMATypeAwareAnnotationsTokenizerFactoryTest test code that is in the >> o.a.lucene structure, but with all the overhead of using Solr. There is no >> corresponding test in the o.a.lucene path for those factory classes. > > > these two tests are explicitly for the Solr factories that are meant to be > declared in a Solr schema, the tests in the lucene/analysis/uima module are > UIMABaseAnalyzerTest (for UIMAAnnotationsTokenizer generated Analyzer) and > UIMATypeAwareAnalyzerTest (for the TypeAware related Analyzer). > >> >> >> 3) When going through the http://wiki.apache.org/solr/SolrUIMA/ tutorial, >> it's very odd that you flip from the wiki page to content that is stored in >> SVN and back as you follow the directions. Especially since the bits of >> sample config in SVN aren't used by tests or anything else. I'd like to >> move them to just the wiki, so they are easier to edit and keep up to date. > > > +1 > >> >> >> 4) When looking at the test files we have annotation engines with names >> like "org.apache.solr.uima.ts.SentimentAnnotation". However, they don't >> exist as classes in the main source tree! And when you go down the rabbit >> hole, you eventually end up at a Java class called >> org.apache.solr.uima.processor.an.DummySentimentAnnotator that actually is >> the aforementioned annotator! I'd like to change the test code so that we >> actually are at least using something called >> "org.apache.solr.uima.ts.DummySentimentAnnotation" or even >> "org.apache.solr.uima.processor.an.DummySentimentAnnotator"! I got very >> excited that out of the box demo had sentiment analysis, and it really >> didn't, just some mock code. > > > maybe just changing SentimentAnnotation to DummySentimentAnnotation would > make things more consistent and avoid confusion. > >> >> >> 5) It appears that when you pass a multivalued field through to UIMA, only >> the first value is actually submitted to Solr. If my XML (solr.xml from >> example docs) looks like: >> >> <field name="features">Advanced Full-Text Search Capabilities using >> Lucene</field> >> <field name="features">Optimized for High Volume Web Traffic</field> >> >> Then what gets processed is only the text "Advanced Full-Text Search >> Capabilities using Lucene"! I have a separate patch I will submit that uses >> getFieldValues() instead of getFieldValue() method on a SolrInputDocument. > > > this sounds like a bug, if you want to open a Jira issue / submit a patch > you're more than welcome, otherwise I can do that. > >> >> >> 6) You need to bump your memory allocation! -Xmx1024m -Xms512m, or it >> WILL run out of heap space when running tests. > > > I was not aware of that, I'll give it a try with a very small heap. > >> >> >> 7) I'd like to move the UIMA xml files etc into the /conf directory, >> instead of accessing the files that are inside the JAR file. Much easier to >> hack on. I copied solr/contrib/uima/src/resources/*.xml into >> solr/example/solr/collection1/conf/uima, and access it via: >> <!--str >> name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str--> >> <str >> name="analysisEngine">solr/${solr.core.instanceDir}/conf/uima/OverridingParamsExtServicesAE.xml</str> > > > ok, sounds good even if the mentioned file is in > src/org/apache/uima/desc/resources which can be edited easily for "playing" > with the tests. > >> >> >> 8) It appears like for each annotation, I can only use the last "feature" >> defined. This doesn't work: >> <lst name="type"> >> <str >> name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str> >> <lst name="mapping"> >> <str name="feature">language</str> >> <str name="field">language</str> >> </lst> >> </lst> >> <lst name="type"> >> <str >> name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str> >> <lst name="mapping"> >> <str name="feature">wikipedia</str> >> <str name="field">language_wikipedia</str> >> </lst> >> </lst> >> >> >> Okay, figured it out finally, and it has to look like this inside a type >> definition: >> <lst name="mapping"> >> <str name="feature">wikipedia</str> >> <str name="field">language_wikipedia</str> >> </lst> >> <lst name="mapping"> >> <str name="feature">language</str> >> <str name="field">language</str> >> </lst> >> <lst name="mapping"> >> <str name="feature">ethnologue</str> >> <str name="fieldNameFeature">language</str> >> <str name="dynamicField">*_sm</str> >> </lst> >> > > sure the latter is how it's supposed to work, as features are related to one > single type. > >> >> >> >> 9) I'd like to patch the default solrconfig.xml to include the UIMA jars, >> and move the config files over to /conf/uima, and then just comment out the >> example. Do we think that this is a good thing? Since you have to have an >> AlchemyAPI key, we could just have the code do the sentence parsing as the >> example, and comment out the alchemyAPI keys in solrconfig.xml. Or, just >> leave them in the source tree, and document the steps? > > > I assume that just adding the elements for importing the libs could be ok, > we should instead avoid adding the AlchemyAPI AE by default due to the key > setting. > I think the best option is open separate Jira tickets for the above tasks > and discuss them more deeply there. > Thanks for your effort Eric. > > Regards, > Tommaso > >> >> >> >> >> >> >> ----------------------------------------------------- >> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | >> http://www.opensourceconnections.com >> Co-Author: Apache Solr 3 Enterprise Search Server available from >> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book >> This e-mail and all contents, including attachments, is considered to be >> Company Confidential unless explicitly stated otherwise, regardless of >> whether attachments are marked as such. >> >> >> >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > -- Lance Norskog [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
