[jira] [Comment Edited] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-06-08 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359353#comment-17359353
 ] 

Markus Jelsma edited comment on NUTCH-2865 at 6/8/21, 1:28 PM:
---

* separated the parseText and parseData
 * changed the response flag
 * contentMeta and parseMeta are now added as valid JSON using GSON
 * Refers-To field added

It 's all not very beautiful code, and some partial duplicate pieces, but it 
works. But with GSON we can choose to remove that as a dependency for some of 
our plugins.


was (Author: markus17):
* separated the parseText and parseData
 * changed the response flag
 * contentMeta and parseMeta are now added as valid JSON using GSON
 * Refers-To field added

It 's all not very beautiful code, and some partial duplicate pieces, but it 
works.

> WARC exporter support for metadata and dropping empty responses
> ---
>
> Key: NUTCH-2865
> URL: https://issues.apache.org/jira/browse/NUTCH-2865
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.19
>
> Attachments: NUTCH-2865.patch, NUTCH-2865.patch, NUTCH-2865.patch
>
>
> WARCExporter is a handy tool to dump the segments. Unfortunately it also 
> emits WARC records for status' other than success of notmodified, which 
> accounts for a decent number in each crawl cycle. It also doesn't emit parsed 
> metadata or extracted text. It does now.
>  
> This patch adds three switches:
>  * -includeOnlySuccessfulResponses to only emit records of success or 
> notmodified
>  * -includeParseData to also emit parse metadata as WARC metadata record
>  * -includeParseText to also emit extracted text as WARC metadata
> Both metadata objects are stored in the same WARC metadata record to save 
> space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (NUTCH-2865) WARC exporter support for metadata and dropping empty responses

2021-05-31 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354530#comment-17354530
 ] 

Sebastian Nagel edited comment on NUTCH-2865 at 5/31/21, 3:57 PM:
--

Hi [~markus17], there are still print statements in the second patch. Yes, good 
idea. Some comments: 

bq. Both metadata objects are stored in the same WARC metadata record to save 
space.

To save space it would be better to compress the WARC files, but to do this 
properly (per record) it would require to use another WARC writing library (eg. 
[jwarc|https://github.com/iipc/jwarc]).

Having everything in a single record isn't easy to read again:
{noformat}
parseText=text starts
continues
continues
ends

parseData=key=value key=value ...
{noformat}

Why not put text and parse data into two records with well-defined and suitable 
content types?

- for the text:
{noformat}
WARC/1.0
WARC-Type: conversion
WARC-Date: ...
WARC-Refers-To: 
Content-Type: text/plain
Content-Length: ...

text starts
continues
continues
ends
{noformat}

- parse met data (btw. what about content metadata?):
{noformat}
WARC/1.0
WARC-Type: metadata
WARC-Date: ...
WARC-Record-ID: ...
WARC-Refers-To: 
Content-Type: application/json
Content-Length: ...

{"key": "value", ...}
{noformat}

Ideally, the conversion and metadata records are linked via UUID with the 
response records.

bq. -omitEmptyResponses

Sometimes (if not frequently) 404s and redirects are not empty but include a 
payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"?

bq. notmodified

You mean not-modified detected via signature comparison? HTTP 304 responses 
(ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See 
also [WARC revisit 
records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit]
 ([revisit 
example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]).



was (Author: wastl-nagel):
Hi [~markus17], there are still print statements in the second patch. Yes, good 
idea 

bq. Both metadata objects are stored in the same WARC metadata record to save 
space.

To save space it would be better to compress the WARC files, but to do this 
properly (per record) it would require to use another WARC writing library (eg. 
[jwarc|https://github.com/iipc/jwarc]).

Having everything in a single record isn't easy to read again:
{noformat}
parseText=text starts
continues
continues
ends

parseData=key=value key=value ...
{noformat}

Why not put text and parse data into two records with well-defined and suitable 
content types?

- for the text:
{noformat}
WARC/1.0
WARC-Type: conversion
WARC-Date: ...
WARC-Refers-To: 
Content-Type: text/plain
Content-Length: ...

text starts
continues
continues
ends
{noformat}

- parse met data (btw. what about content metadata?):
{noformat}
WARC/1.0
WARC-Type: metadata
WARC-Date: ...
WARC-Record-ID: ...
WARC-Refers-To: 
Content-Type: application/json
Content-Length: ...

{"key": "value", ...}
{noformat}

Ideally, the conversion and metadata records are linked via UUID with the 
response records.

bq. -omitEmptyResponses

Sometimes (if not frequently) 404s and redirects are not empty but include a 
payload (a customized error page). Maybe "-includeOnlySuccessfulResponses"?

bq. notmodified

You mean not-modified detected via signature comparison? HTTP 304 responses 
(ProtocolStatus.NOTMODIFIED) definitely have no payload (response body). See 
also [WARC revisit 
records|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit]
 ([revisit 
example|https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record]).


> WARC exporter support for metadata and dropping empty responses
> ---
>
> Key: NUTCH-2865
> URL: https://issues.apache.org/jira/browse/NUTCH-2865
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.19
>
> Attachments: NUTCH-2865.patch, NUTCH-2865.patch
>
>
> WARCExporter is a handy tool to dump the segments. Unfortunately it also 
> emits WARC records for status' other than success of notmodified, which 
> accounts for a decent number in each crawl cycle. It also doesn't emit parsed 
> metadata or extracted text. It does now.
>  
> This patch adds three switches:
>  * -omitEmptyResponses to only emit records of success or notmodified
>  * -includeParseData to also emit parse metadata as WARC metadata record
>  * -includeParseText to also emit extracted text as WARC metadata
> Both metadata objects are stored in the same WARC metadata record to save 
> space.



--
This message was sent by Atlassian Jira