LuceneIterator throws an IllegalStateException when a null TermFreqVector is 
encountered for a document instead of skipping to the next one
-------------------------------------------------------------------------------------------------------------------------------------------

                 Key: MAHOUT-675
                 URL: https://issues.apache.org/jira/browse/MAHOUT-675
             Project: Mahout
          Issue Type: Improvement
          Components: Utils
            Reporter: Chris Jordan


The org.apache.mahout.utils.vectors.lucene.LuceneIterator currently throws an 
IllegalStateException if it encounters a document with a null term frequency 
vector for the target field in the computeNext() method. That is problematic 
for people who are developing text mining applications on top of lucene as it 
forces them to check that the documents that they are adding to their lucene 
indexes actually have terms for the target field. While that check may sound 
reasonable, it actually is not in practice.

Lucene in most cases will apply an analyzer to a field in a document as it is 
added to the index. The StandardAnalyzer is pretty lenient and barely removes 
any terms. In most cases though, if you want to have better text mining 
performance, you will create your own custom analyzer. For example, in my 
current work with document clustering, in order to generate tighter clusters 
and have more human readable top terms, I am using a stop word list specific to 
my subject domain and I am filtering out terms that contain numbers. The net 
result is that some of my documents have no terms for the target field which is 
a desirable outcome. When I attempt to dump the lucene vectors though, I 
encounter an IllegalStateException because of those documents.

Now it is possible for me to check the TokenStream of the target field before I 
insert into my index however, if we were to follow that approach, it means for 
each of my applications, I would have to perform this check. That isn't a great 
practice as someone could be experimenting with custom analyzers to improve 
text mining performance and then encounter this exception without any real 
indication that it was due to the custom analyzer.

I believe a better approach is to log a warning with the field id of the 
problem document and then skip to the next one. That way, a warning will be in 
the logs and the lucene vector dump process will not halt.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to