Hello All,
I was wondering if there is any way to check the integrity of a segment? As it
stands, I can't create the index I want due to a number of my segments freaking
out like below :
Is there anyway to check if my segments are OK, I guess i could always re:fetch
them if need be.
Regards, and thanks in advance :)
Mischa
<!--
java.io.IOException: Could not obtain block: blk_8431627671702898365_95075
file=/user/nutch/crawl/segments/20091012145602/crawl_generate/part-00000
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:166)
at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerger.java:161)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
...
java.io.IOException: Could not obtain block: blk_7970643458650610887_21674
file=/user/nutch/crawl/segments/20090618111426/content/part-00003/data
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1707)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1535)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1662)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat.getRecordReader(SegmentMerger.java:150)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
-->
On 26 Nov 2009, at 12:03, Santiago Pérez wrote:
>
> Hej,
>
> I am a newbie in Nutch and I need some help with a problem because I do not
> find clear documentation.
>
> In crawling proccess when the each of the FetcherThread get the content,
> this is in formatted in a way which deletes the new line characters ("\n")
> and transform useful characters in Spanish as á,é,í,ó,ú,ñ,ü in the default
> encoding like: �¡, �³, � , �³, �º, �±, �¼.
>
> I would like to know if it is possible to set this default encoding (is
> UTF-8?) to the one that I need (ASCII I guess).
>
> Thanks in advance ;)
> --
> View this message in context:
> http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
___________________________________
Mischa Tuffield
Email: [email protected]
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD