Re: [jira] [Updated] (NUTCH-2042) parse-html increase chunk size used to detect charset

Mattmann, Chris A (3980) Thu, 23 Jul 2015 13:50:54 -0700

+1

Sent from my iPhone


> On Jul 23, 2015, at 1:47 PM, Sebastian Nagel (JIRA) <[email protected]> wrote:
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/NUTCH-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Sebastian Nagel updated NUTCH-2042:
> -----------------------------------
>    Attachment: NUTCH-2042-trunk-v2.patch
> 
> Updated patch for trunk (trivial change). Objections to commit?
> 
>> parse-html increase chunk size used to detect charset
>> -----------------------------------------------------
>> 
>>                Key: NUTCH-2042
>>                URL: https://issues.apache.org/jira/browse/NUTCH-2042
>>            Project: Nutch
>>         Issue Type: Bug
>>         Components: parser
>>   Affects Versions: 2.3, 1.10
>>           Reporter: Sebastian Nagel
>>           Priority: Minor
>>            Fix For: 2.4, 1.11
>> 
>>        Attachments: NUTCH-2042-2x-v1.patch, NUTCH-2042-trunk-v1.patch, 
>> NUTCH-2042-trunk-v2.patch
>> 
>> 
>> The chunk used to detect the encoding of a document is set to 2000 bytes. 
>> Although it is definitely best practice to "define" the character set on 
>> top, 2000 bytes are sometimes not enough: 20 longer <link> elements pointing 
>> to javascript and css libs may "hide" the <meta> element containing content 
>> type and encoding. Same problem has been observed in TIKA-357 and solved by 
>> increasing the buffer size to 8 kB.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Re: [jira] [Updated] (NUTCH-2042) parse-html increase chunk size used to detect charset

Reply via email to