[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL
[ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327882#comment-16327882 ] ASF GitHub Bot commented on NUTCH-2370: --- chrismattmann commented on issue #180: fix for NUTCH-2370 contributed by msha...@usc.edu URL: https://github.com/apache/nutch/pull/180#issuecomment-358121493 awesome! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > FileDumper: save JSON mapping file -> URL > - > > Key: NUTCH-2370 > URL: https://issues.apache.org/jira/browse/NUTCH-2370 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.14 >Reporter: Madhav Sharan >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > > - nutch dump [0] is a great tool to simply dump all the crawled files from > nutch segments. > - After dump we loose information about URL from which this file was crawled. > URL is used to name dumped file but that information is encrypted. > - In `reverseUrlDirs` option one can figure out URL by checking the file path > but even accessing file path is little complicated than simple mapping file. > - In `flatdir` there is no way to know actual URL. > I am submitting a PR which edits [0] and saves a json for each crawled > segment which maps a file path to URL. > [0] > https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL
[ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295241#comment-16295241 ] Hudson commented on NUTCH-2370: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See [https://builds.apache.org/job/Nutch-trunk/3487/]) fix for NUTCH-2370 contributed by msha...@usc.edu (snagel: [https://github.com/apache/nutch/commit/34236ffecf478a1776559b0ed8c1ad929483d752]) * (edit) src/java/org/apache/nutch/tools/FileDumper.java > FileDumper: save JSON mapping file -> URL > - > > Key: NUTCH-2370 > URL: https://issues.apache.org/jira/browse/NUTCH-2370 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.14 >Reporter: Madhav Sharan >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > > - nutch dump [0] is a great tool to simply dump all the crawled files from > nutch segments. > - After dump we loose information about URL from which this file was crawled. > URL is used to name dumped file but that information is encrypted. > - In `reverseUrlDirs` option one can figure out URL by checking the file path > but even accessing file path is little complicated than simple mapping file. > - In `flatdir` there is no way to know actual URL. > I am submitting a PR which edits [0] and saves a json for each crawled > segment which maps a file path to URL. > [0] > https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL
[ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294208#comment-16294208 ] Hudson commented on NUTCH-2370: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3486 (See [https://builds.apache.org/job/Nutch-trunk/3486/]) fix for NUTCH-2370 contributed by msha...@usc.edu (goyal.madhav: [https://github.com/apache/nutch/commit/fd6f20e6dfc9a4a7bbad3478e6af4469d9449cca]) * (edit) src/java/org/apache/nutch/tools/FileDumper.java > FileDumper: save JSON mapping file -> URL > - > > Key: NUTCH-2370 > URL: https://issues.apache.org/jira/browse/NUTCH-2370 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.14 >Reporter: Madhav Sharan >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.14 > > > - nutch dump [0] is a great tool to simply dump all the crawled files from > nutch segments. > - After dump we loose information about URL from which this file was crawled. > URL is used to name dumped file but that information is encrypted. > - In `reverseUrlDirs` option one can figure out URL by checking the file path > but even accessing file path is little complicated than simple mapping file. > - In `flatdir` there is no way to know actual URL. > I am submitting a PR which edits [0] and saves a json for each crawled > segment which maps a file path to URL. > [0] > https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL
[ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294183#comment-16294183 ] ASF GitHub Bot commented on NUTCH-2370: --- sebastian-nagel commented on issue #180: fix for NUTCH-2370 contributed by msha...@usc.edu URL: https://github.com/apache/nutch/pull/180#issuecomment-352260567 +1 LGTM! Thanks, @smadha! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > FileDumper: save JSON mapping file -> URL > - > > Key: NUTCH-2370 > URL: https://issues.apache.org/jira/browse/NUTCH-2370 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.14 >Reporter: Madhav Sharan >Priority: Minor > Fix For: 1.14 > > > - nutch dump [0] is a great tool to simply dump all the crawled files from > nutch segments. > - After dump we loose information about URL from which this file was crawled. > URL is used to name dumped file but that information is encrypted. > - In `reverseUrlDirs` option one can figure out URL by checking the file path > but even accessing file path is little complicated than simple mapping file. > - In `flatdir` there is no way to know actual URL. > I am submitting a PR which edits [0] and saves a json for each crawled > segment which maps a file path to URL. > [0] > https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL
[ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294184#comment-16294184 ] ASF GitHub Bot commented on NUTCH-2370: --- sebastian-nagel closed pull request #180: fix for NUTCH-2370 contributed by msha...@usc.edu URL: https://github.com/apache/nutch/pull/180 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/java/org/apache/nutch/tools/FileDumper.java b/src/java/org/apache/nutch/tools/FileDumper.java index e8b0f46e8..31218bbb0 100644 --- a/src/java/org/apache/nutch/tools/FileDumper.java +++ b/src/java/org/apache/nutch/tools/FileDumper.java @@ -57,6 +57,7 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import org.codehaus.jackson.map.ObjectMapper; /** * * The file dumper tool enables one to reverse generate the raw content from @@ -158,6 +159,7 @@ public void dump(File outputDir, File segmentRootDir, String[] mimeTypes, boolea for (File segment : segmentDirs) { LOG.info("Processing segment: [" + segment.getAbsolutePath() + "]"); DataOutputStream doutputStream = null; + Map filenameToUrl = new HashMap(); File segmentDir = new File(segment.getAbsolutePath(), Content.DIR_NAME); File[] partDirs = segmentDir.listFiles(file -> file.canRead() && file.isDirectory()); @@ -247,7 +249,7 @@ public void dump(File outputDir, File segmentRootDir, String[] mimeTypes, boolea } else { outputFullPath = String.format("%s/%s", fullDir, DumpFileUtil.createFileName(md5Ofurl, baseName, extension)); } - + filenameToUrl.put(outputFullPath, url); File outputFile = new File(outputFullPath); if (!outputFile.exists()) { @@ -289,6 +291,10 @@ public void dump(File outputDir, File segmentRootDir, String[] mimeTypes, boolea } } } + //save filenameToUrl in a json file for each segment there is one mapping file + String filenameToUrlFilePath = String.format("%s/%s_filenameToUrl.json", outputDir.getAbsolutePath(), segment.getName() ); + new ObjectMapper().writeValue(new File(filenameToUrlFilePath), filenameToUrl); + } LOG.info("Dumper File Stats: " + DumpFileUtil.displayFileTypes(typeCounts, filteredCounts)); This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > FileDumper: save JSON mapping file -> URL > - > > Key: NUTCH-2370 > URL: https://issues.apache.org/jira/browse/NUTCH-2370 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.14 >Reporter: Madhav Sharan >Priority: Minor > Fix For: 1.14 > > > - nutch dump [0] is a great tool to simply dump all the crawled files from > nutch segments. > - After dump we loose information about URL from which this file was crawled. > URL is used to name dumped file but that information is encrypted. > - In `reverseUrlDirs` option one can figure out URL by checking the file path > but even accessing file path is little complicated than simple mapping file. > - In `flatdir` there is no way to know actual URL. > I am submitting a PR which edits [0] and saves a json for each crawled > segment which maps a file path to URL. > [0] > https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java -- This message was sent by Atlassian JIRA (v6.4.14#64029)