Ya. I know about that. But I just thought that because Parse_Data already
does that for us, I did not want to do tthe same processing again. I will
try to figure something out. Thanks a lot.

Regards,
Ami Parikh
(213)590-0005

On Thu, Feb 26, 2015 at 12:39 PM, Renxia Wang <[email protected]> wrote:

> Not sure how you implement it so it is hard to tell. You may want to take
> a look at the SegmentReader's get and getMapRecords methods. Those may give
> you ideas. You can use SegmentReader.get directly to get the segment data
> too. While it is slow as it slepp(5000) at every time you call it, so slow
> that you definitely cannot get the result tomorrow by running it on your
> 50K urls data set. Muti-threading to call the SegmentReader.get on all the
> segments at the same time can speed this up, while if you have a lot of
> segments(like me,  > 20), OutOfMemory issue will come to you, even if you
> set the java heap size to be 4GBs(or even more) I am stuck at here. T_T
>
> Zhique
>
>
>
> On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh <[email protected]>
> wrote:
>
>> I am using the MapFileReader to iterate through the file. And I read the
>> key into a Text object and the MetaData into a ParseData object. I get the
>> following exception:
>>
>> Exception in thread "main" java.io.EOFException
>> at java.io.DataInputStream.readFully(DataInputStream.java:197)
>> at org.apache.hadoop.io.Text.readString(Text.java:402)
>> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
>> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
>> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
>> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
>> at NearDuplicates.main(NearDuplicates.java:58)
>>
>> Thanks,
>>
>> Regards,
>> Ami Parikh
>> (213)590-0005
>>
>> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang <[email protected]> wrote:
>>
>>> Hi Ami,
>>>
>>> What method of what class do you use to get the meta data? Please
>>> provide more info about this, log etc.
>>>
>>> Zhique
>>>
>>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> When I try to use the parse_data from the segment directory for getting
>>>> the MetaData for finding near duplicates, My code runs into a EOFException.
>>>> I found something about a bug in nutch in the archives, but I wanted to
>>>> know if anyone else is facing this problem and how can I possibly resolve
>>>> it.
>>>>
>>>> Thanks,
>>>>
>>>> Regards,
>>>> Ami Parikh
>>>> (213)590-0005
>>>>
>>>
>>>
>>
>

Reply via email to