[
https://issues.apache.org/jira/browse/MAHOUT-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396290#comment-14396290
]
Pat Ferrel commented on MAHOUT-1674:
------------------------------------
[[email protected]] will not be able to fix this until 0.10.1 so [~pferrel] is
looking for some fix guidance for a short term work around.
The reason this is hard to ignore is that two users are gathering data with
Spark streaming, which tends to create lots of small files, and they have run
into this error. Kafka (or other) to Spark Streaming will be an increasingly
popular method for input to cooccurrence calculation.
The only known workaround is to concatenated input files before reading them
into Mahout. This has been verified in only one case.
> A'A fails getting with an index out of range for a row vector
> -------------------------------------------------------------
>
> Key: MAHOUT-1674
> URL: https://issues.apache.org/jira/browse/MAHOUT-1674
> Project: Mahout
> Issue Type: Bug
> Components: s
> Affects Versions: 0.10.0
> Reporter: Pat Ferrel
> Assignee: Dmitriy Lyubimov
> Priority: Critical
> Fix For: 0.10.0
>
>
> A'A and possibly A'B can fail with an index out of bounds on the row vector.
> This seems related to partitioning where some partitions may be empty.
> This can be reproduce with the attached data as input into
> spark-itemsimilarity. This is only A data and the one large csv will complete
> correctly but passing in the directory of part files will exhibit the error.
> The data is identical except in the number of files that are used to contain
> the data.
> The error occurs using the local raw filesystem and with master = local and
> is pretty fast to reach.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)