----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9119/#review52796 -----------------------------------------------------------
./trunk/src/java/org/apache/nutch/tools/FileDumper.java <https://reviews.apache.org/r/9119/#comment91927> This should read FileDumper <output directory> <segments dir> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java <https://reviews.apache.org/r/9119/#comment91928> If I invoke this tool without ANY arguments, I get the following lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper 2014-09-09 15:57:19.045 java[3866:1903] Unable to load realm info from SCDynamicStore Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.nutch.tools.FileDumper.main(FileDumper.java:53) ./trunk/src/java/org/apache/nutch/tools/FileDumper.java <https://reviews.apache.org/r/9119/#comment91929> When I invoke this tool as follows lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/ 2014-09-09 15:59:06.185 java[3883:1903] Unable to load realm info from SCDynamicStore Sep 09, 2014 3:59:06 PM org.apache.nutch.tools.FileDumper main INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635] Exception in thread "main" java.io.IOException: wrong key class: org.apache.hadoop.io.Text is not class org.apache.hadoop.io.UTF8 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1886) at org.apache.nutch.tools.FileDumper.main(FileDumper.java:99) ./trunk/src/java/org/apache/nutch/tools/FileDumper.java <https://reviews.apache.org/r/9119/#comment91931> When I change the Text() class to use the UTF8() class, I get the following lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/ 2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from SCDynamicStore Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635] Exception in thread "main" java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99) at org.apache.nutch.protocol.Content.readFields(Content.java:154) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101) UTF8 is of course deprecated now so we need to stick with Text and implement the corect code. - Lewis McGibbney On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/9119/ > ----------------------------------------------------------- > > (Updated Sept. 6, 2014, 4:57 a.m.) > > > Review request for nutch and Julien Le Dem. > > > Bugs: NUTCH-1526 > https://issues.apache.org/jira/browse/NUTCH-1526 > > > Repository: nutch > > > Description > ------- > > Will contain the patch the SegmentContentDumperTool described in NUTCH-1526: > > ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] > -segmentRootDir full file path to the root segment directory, e.g., > crawl/segments > -regexUrlPattern a regex URL pattern to select URL keys to dump from the > content DB in each segment > -outputDir The output directory to write file names to. > -metadata --key=value where key is a Content Metadata key and value is a > value to check. > > > Diffs > ----- > > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION > > Diff: https://reviews.apache.org/r/9119/diff/ > > > Testing > ------- > > Testing it on DARPA XDATA XNET. > > > Thanks, > > Chris Mattmann > >

