[jira] [Updated] (NUTCH-1975) New configuration for CommonCrawlDataDumper tool

Giuseppe Totaro (JIRA) Thu, 02 Apr 2015 12:15:05 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Giuseppe Totaro updated NUTCH-1975:
-----------------------------------
    Attachment: NUTCH-1975.v03.patch

Patch v03 adds support for filename too long. More in detail, file extension is 
truncated if it is more than {{MAX_LENGTH_OF_EXTENSION}} as made in other 
methods of {{DumpFileUtil}}. 
By the way, file extension refers to the text after the last dot in the url 
string. This part can be either the actual extension of the file or other text 
(e.g., text after the last dot in the query part, if any). However, SHA1 digest 
is calculated against the original (not truncated) filename.
Thanks [~chrismattmann] for testing this new configuration.

> New configuration for CommonCrawlDataDumper tool
> ------------------------------------------------
>
>                 Key: NUTCH-1975
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1975
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>    Affects Versions: 1.9
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-1975.patch, NUTCH-1975.v02.patch, 
> NUTCH-1975.v03.patch
>
>
> Hi all, you can find in attachment a new patch including support for new 
> options for {{CommonCrawlDataDumper}}.
> In particultar, new options are passed to {{CommonCrawlFormat}} object (which 
> provides methods to create JSON output) using a configuration object 
> ({{CommonCrawlConfig}}).
> In particular, in this patch {{CommonCrawlDataDumper}} provides support for 
> the following options:
> * {{-SimpleDataFormat}}: enables timestamps in GMT epoche (milliseconds) 
> format.
> * {{-epochFilename}}: files extracted will be organized in a reversed-NDS 
> tree based on the FQDN of the webpage, followed by a SHA1 hash of the 
> complete URL. Scraped data will be stored in these directories as individual 
> GMT-timestamped files using "epoche time (in milliseconds)" plus file 
> extension.
> * {{-jsonArray}}: organizes both request and response headers into a JSON 
> array instead of using a JSON sub-object.
> *{{-reverseKey}}: enables to use the same layout as described for 
> -epochFilename option, with underscore in place of directory separators.
> You can use the options above in addition to the options already supported, 
> as described in the [Nutch 
> wiki|https://wiki.apache.org/nutch/CommonCrawlDataDumper] page.
> This patch starts from 
> [NUTCH-1974|https://issues.apache.org/jira/browse/NUTCH-1974].
> Thanks [~chrismattmann] and [~annieburgess] for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1975) New configuration for CommonCrawlDataDumper tool

Reply via email to