Broken segments ?

Mischa Tuffield Thu, 26 Nov 2009 11:55:20 -0800

Hello All, 

I was wondering if there is any way to check the integrity of a segment? As it 
stands, I can't create the index I want due to a number of my segments freaking 
out like below :


Is there anyway to check if my segments are OK, I guess i could always re:fetch 
them if need be.

Regards, and thanks in advance :)

Mischa


<!--
java.io.IOException: Could not obtain block: blk_8431627671702898365_95075 
file=/user/nutch/crawl/segments/20091012145602/crawl_generate/part-00000
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
        at 
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:166)
        at 
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:161)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

...

java.io.IOException: Could not obtain block: blk_7970643458650610887_21674 
file=/user/nutch/crawl/segments/20090618111426/content/part-00003/data
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
        at 
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat.getRecordReader(SegmentMerger.java:150)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
-->


On 26 Nov 2009, at 12:03, Santiago Pérez wrote:

> 
> Hej,
> 
> I am a newbie in Nutch and I need some help with a problem because I do not
> find clear documentation.
> 
> In crawling proccess when the each of the FetcherThread get the content,
> this is in formatted in a way which deletes the new line characters ("\n")
> and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default
> encoding like: Ã?Â¡, Ã?Â³, Ã?Â , Ã?Â³, Ã?Âº, Ã?Â±, Ã?Â¼.
> 
> I would like to know if it is possible to set this default encoding (is
> UTF-8?) to the one that I need (ASCII I guess).
> 
> Thanks in advance ;)
> -- 
> View this message in context: 
> http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

___________________________________
Mischa Tuffield
Email: [email protected]
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Broken segments ?

Reply via email to