Re: M/R capturing line numbers in text files

Ted Dunning Tue, 15 Jun 2010 21:59:08 -0700

This has come up before and can be a bit tricky to diagnose without looking
through the code carefully.


The basic problem is that something has produced data that uses a long as an
ID and your mapper is expecting an int.  Have you posted your code as a
patch on the jira or a git link?

On Tue, Jun 15, 2010 at 9:55 PM, Shannon Quinn <[email protected]> wrote:

> Hi Ted,
>
> Thank you very much - very valuable insight as to a more robust input
> format. I've already started implementing it.
>
> I finished the new M/R process to reflect the new assumed input format
> (submitted the patch), but I'm getting an exception I can't seem to
> diagnose. When I start the program, and the INFO lines start rolling from
> the process, right before the M/R task begins I get the following:
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> cast to org.apache.hadoop.io.IntWritable
>    at
> org.apache.mahout.clustering.eigencuts.EigencutsInputMapper.map(EigencutsInputMapper.java:22)
>
> The line 22 referred to in the message is:
>
> public class EigencutsInputMapper extends Mapper<IntWritable, Text,
> IntWritable, DistributedRowMatrix.MatrixEntryWritable> {
>
> I did a search in all my source files; no mention anywhere (except one
> commented-out line) of LongWritable. It was in my previous implementation,
> but I performed mvn clean multiple times. Any thoughts would be appreciated.
>
> Thank you again!
>
> Regards,
> Shannon
>
>
> On 6/15/2010 7:03 PM, Ted Dunning wrote:
>
>> Shannon,
>>
>> Nice work so far.
>>
>> I think it is a bit more customary to enter a graph by giving the integer
>> pairs that represent the starting and ending nodes for each arc.  That
>> avoids the memory allocation problem you hit if one node is connected to
>> millions of others.  It also may solve your problem of the distributed row
>> matrix since you could write a reducer to gather everything to the right
>> place for writing a row.  In doing that, you would inherently have the row
>> number available because that would be the grouping key.
>>
>> If you keep the current one matrix row per csv line, I would recommend
>> putting the source node at the beginning of the line.
>>
>>
>> On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn<[email protected]>  wrote:
>>
>>
>>
>>> 1) I've made the assumption so far that the input to my clustering
>>> algorithm will be a single CSV file containing the entire affinity
>>> matrix,
>>> where each line in the file is a row in the matrix. Is there another
>>> input
>>> approach that would work better for reading this affinity matrix?
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: M/R capturing line numbers in text files

Reply via email to