+1 Sent from my iPhone
> On Jul 23, 2015, at 1:47 PM, Sebastian Nagel (JIRA) <[email protected]> wrote: > > > [ > https://issues.apache.org/jira/browse/NUTCH-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Sebastian Nagel updated NUTCH-2042: > ----------------------------------- > Attachment: NUTCH-2042-trunk-v2.patch > > Updated patch for trunk (trivial change). Objections to commit? > >> parse-html increase chunk size used to detect charset >> ----------------------------------------------------- >> >> Key: NUTCH-2042 >> URL: https://issues.apache.org/jira/browse/NUTCH-2042 >> Project: Nutch >> Issue Type: Bug >> Components: parser >> Affects Versions: 2.3, 1.10 >> Reporter: Sebastian Nagel >> Priority: Minor >> Fix For: 2.4, 1.11 >> >> Attachments: NUTCH-2042-2x-v1.patch, NUTCH-2042-trunk-v1.patch, >> NUTCH-2042-trunk-v2.patch >> >> >> The chunk used to detect the encoding of a document is set to 2000 bytes. >> Although it is definitely best practice to "define" the character set on >> top, 2000 bytes are sometimes not enough: 20 longer <link> elements pointing >> to javascript and css libs may "hide" the <meta> element containing content >> type and encoding. Same problem has been observed in TIKA-357 and solved by >> increasing the buffer size to 8 kB. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)

