Its caused by a getQuick in the bowels of infer(). Adding the test to every access inside the loop seems more expensive than just catching the exception in the mapper.

On 5/23/10 3:29 PM, Sean Owen wrote:
Switching to dev list.

I don't want to belabor a small point, but I'm wondering whether it's
just better to check whatever array access causes the problem before
it's accessed?

Meaning...

if (foo>= array.length) {
   throw new IllegalStateException();
}
... array[foo] ...

instead of

try {
... array[foo] ...
} catch (ArrayIndexOutOfBoundsException aioobe) {
   throw new IllegalStateException();
}

On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
<[email protected]>  wrote:
I added a try block in the mapper which catches the exception and outputs:

java.lang.IllegalStateException: This is probably because the --numWords
argument is set too small.
    It needs to be>= than the number of words (terms actually) in the corpus
and can be
    larger if some storage inefficiency can be tolerated.
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
    at
org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
    at
org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
    at
org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
    ... 5 more

I'll commit that for now while we explore a more elegant solution.


On 5/23/10 2:45 PM, Sean Owen wrote:
Even something as simple as checking that bound and throwing
IllegalStateException with a custom message -- yeah I imagine it's
hard to detect this anytime earlier. Just a thought.

On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
<[email protected]>    wrote:

I agree it is not very friendly. Impossible to tell the correct value in
the
options section processing. It needs to be>= than the actual number of
unique terms in the corpus and that is hard to anticipate though I think
it
is known in seq2sparse. If it turns out to be the dictionary size (I'm
investigating), then it could be computed by adding a dictionary path
argument instead of the current option. Trouble with that is the
dictionary
is not needed for anything else by LDA.





Reply via email to