[ 
https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001953#comment-17001953
 ] 

ASF GitHub Bot commented on NUTCH-1863:
---------------------------------------

sebastian-nagel commented on pull request #490: Fix for NUTCH-1863: Add JSON 
format dump output to readdb command
URL: https://github.com/apache/nutch/pull/490#discussion_r360712503
 
 

 ##########
 File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
 ##########
 @@ -185,13 +190,83 @@ public synchronized void write(Text key, CrawlDatum 
value)
         out.writeByte('\n');
       }
 
-      public synchronized void close(TaskAttemptContext context) throws 
IOException {
+      public synchronized void close(TaskAttemptContext context)
+          throws IOException {
+        out.close();
+      }
+    }
+
+    public RecordWriter<Text, CrawlDatum> getRecordWriter(
+        TaskAttemptContext context) throws IOException {
+      String name = getUniqueFile(context, "part", "");
+      Path dir = FileOutputFormat.getOutputPath(context);
+      FileSystem fs = dir.getFileSystem(context.getConfiguration());
+      DataOutputStream fileOut = fs.create(new Path(dir, name), context);
+      return new LineRecordWriter(fileOut);
+    }
+  }
+
+  public static class CrawlDatumJsonOutputFormat
+      extends FileOutputFormat<Text, CrawlDatum> {
+    protected static class LineRecordWriter
+        extends RecordWriter<Text, CrawlDatum> {
+      private DataOutputStream out;
+      private ArrayList<String> jsonString = new ArrayList<String>();
 
 Review comment:
   Nutch is designed to be scalable and should be able to handle CrawlDbs 
containing billions of items. Even, if the CrawlDb is split into many 
partitions, every task has to handle 100 million CrawlDb items. However, it's 
hardly possible to keep such a huge list in memory, resp. it would be a too 
expensive use of resources. So, it's mandatory to write each record directly to 
the output, Hadoop itself handles the large outputs efficiently by buffering 
parts in memory and spilling to disk if the buffers are full.
   
   To simplify the process: why not use the [JSON lines 
format](http://jsonlines.org/)? Every line one JSON-formatted record: `{"url": 
"...", ...}`
   - makes it much easier to write the output because you need not to preserver 
state internally to properly open and close the list of records.
   - JSON lines has several other advantages:
     - it's possible to grep for records
     - also [jq](https://stedolan.github.io/jq/) can process it
     - and again: any JSON parser needs to keep only a single line in memory
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Add JSON format dump output to readdb command
> ---------------------------------------------
>
>                 Key: NUTCH-1863
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1863
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>    Affects Versions: 2.3, 1.10
>            Reporter: Lewis John McGibbney
>            Assignee: Shashanka Balakuntala Srinivasa
>            Priority: Major
>             Fix For: 1.17
>
>
> Opening up the ability for third parties to consume Nutch crawldb data as 
> JSON would be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X to enable JSON 
> dumps of crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to