[
https://issues.apache.org/jira/browse/MAHOUT-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated MAHOUT-675:
-----------------------------
Resolution: Fixed
Fix Version/s: 0.5
Status: Resolved (was: Patch Available)
I looked at this recently and agree this sounds like at least equally
reasonable behavior.
> LuceneIterator throws an IllegalStateException when a null TermFreqVector is
> encountered for a document instead of skipping to the next one
> -------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-675
> URL: https://issues.apache.org/jira/browse/MAHOUT-675
> Project: Mahout
> Issue Type: Improvement
> Components: Utils
> Reporter: Chris Jordan
> Fix For: 0.5
>
> Attachments: MAHOUT-675
>
>
> The org.apache.mahout.utils.vectors.lucene.LuceneIterator currently throws an
> IllegalStateException if it encounters a document with a null term frequency
> vector for the target field in the computeNext() method. That is problematic
> for people who are developing text mining applications on top of lucene as it
> forces them to check that the documents that they are adding to their lucene
> indexes actually have terms for the target field. While that check may sound
> reasonable, it actually is not in practice.
> Lucene in most cases will apply an analyzer to a field in a document as it is
> added to the index. The StandardAnalyzer is pretty lenient and barely removes
> any terms. In most cases though, if you want to have better text mining
> performance, you will create your own custom analyzer. For example, in my
> current work with document clustering, in order to generate tighter clusters
> and have more human readable top terms, I am using a stop word list specific
> to my subject domain and I am filtering out terms that contain numbers. The
> net result is that some of my documents have no terms for the target field
> which is a desirable outcome. When I attempt to dump the lucene vectors
> though, I encounter an IllegalStateException because of those documents.
> Now it is possible for me to check the TokenStream of the target field before
> I insert into my index however, if we were to follow that approach, it means
> for each of my applications, I would have to perform this check. That isn't a
> great practice as someone could be experimenting with custom analyzers to
> improve text mining performance and then encounter this exception without any
> real indication that it was due to the custom analyzer.
> I believe a better approach is to log a warning with the field id of the
> problem document and then skip to the next one. That way, a warning will be
> in the logs and the lucene vector dump process will not halt.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira