Thamme Gowda N created NUTCH-2250:
-------------------------------------
Summary: CommonCrawlDumper : Invalid format + skipped parts
Key: NUTCH-2250
URL: https://issues.apache.org/jira/browse/NUTCH-2250
Project: Nutch
Issue Type: Sub-task
Components: commoncrawl
Affects Versions: 1.12
Environment: Linux x64
Java 7
Nutch 1.12
Reporter: Thamme Gowda N
The following issues are found with CommonCrawlDumper;
1. Documents get duplicated in dump files
How to reproduce
{code}
bin/nutch commoncrawldump -segment .../segments -outputDir testdump
-SimpleDateFormat -epochFilename -jsonArray -reverseKey
{code}
The first ever written will contain 1 document.
second file includes two documents
third file includes first three documents and this grows linearly.
2.If a segment has many parts (part-00000, part-00001,...) only the first part
(part-00000 ) is being dumped
How to reproduce ?
Create segment with two parts (part-00000 and part-00001)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)