[
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Rowe updated LUCENE-3690:
--------------------------------
Attachment: HTMLStripCharFilterWarcTest.java
JFlexHTMLStripCharFilterWarcTest.java
BaselineWarcTest.java
LUCENE-3690.patch
bq. some test docs using typical wellformed html markup
I have access to [ClueWeb09|http://lemurproject.org/clueweb09.php]. For
performance testing I used the first WARC file for the English and Chinese
languages ({{en0000/00.warc.gz}} and {{zh0000/00.warc.gz}}), each of which when
uncompressed contains about 1GB of text (including a small amount of non-HTML
metadata: WARC information and HTTP headers). The English WARC contains about
35,000 documents from about 2,100 unique domains. The Chinese WARC contains
about 33,000 documents from about 550 unique domains.
I compared {{JFlexHTMLStripCharFilter}}'s output with that of
{{HTMLStripCharFilter}} for several hundred documents. In the course of this
comparison, I found several problems with the JFlex implementation (e.g. no
{{<STYLE>}} tag handling; no MS conditional tag handling, e.g. {{<!\[if !
IE]>}}; and some problems handling creative attribute values), which the
attached patch fixes. I re-ran the text-only and malformed HTML performance
tests on the final implementation, and the numbers aren't significantly
different from those prior to these fixes. The new patch also contains the
more-evil {{_TestUtils.randomHtmlishString()}}; shifts the {{CharFilter}}
javadocs from {{BaseCharFilter.addOffCorrectMapping()}} to {{package.html}};
and adds several more tests to {{JFlexHTMLStripCharFilterTest.java}}.
I have attached the three classes I used to test performance over the ClueWeb09
subset. {{BaselineWarcTest.java}} uses [the WarcRecord class supplied with the
ClueWeb09
collection|http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Working+with+WARC+Files]
to read the compressed WARC files; looks for a declared charset first in each
document's content in the Content-Type {{<meta>}} tag, and then in the HTTP
header; feeds this charset, if any, to the ICU4J charset detector, which
instantiates a Reader using the detected charset; and then {{read()'s}} all
content. The other two classes add the respective CharFilter on top of
{{BaselineWarcTest}}'s functionality.
The performance numbers (best of 5 trials):
||Language||Baseline||Classic||JFlex||
|English|156s|179s|171s|
|Chinese|155s|180s|172s|
Excluding charset detection and I/O (measured by {{BaselineWarcTest}}),
{{JFlexHTMLStripCharFilter}} appears to improve on {{HTMLStripCharFilter}}'s
throughput by about 50% in both languages.
I found a few problems with {{HTMLStripCharFilter}}:
# The following exception was thrown for six of the English documents:
{noformat}
java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
at
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
at
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
at HTMLStripCharFilterWarcTest.main(HTMLStripCharFilterWarcTest.java:86)
{noformat}
# {{&apos;}} is not decoded.
# Content between some <script> tags is not stripped out.
# Unbalanced quotation marks in opening tags cause the tag to not be stripped
out.
Left to do:
# Rename {{HTMLStripCharFilter}} to {{ClassicHTMLStripCharFilter}}; move it to
Solr {{o.a.s.analysis}} package; deprecate it; and create a new Solr Factory
for it.
# Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
# Commit to trunk
# Backport and commit to branch_3x.
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
> Key: LUCENE-3690
> URL: https://issues.apache.org/jira/browse/LUCENE-3690
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 3.5, 4.0
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java,
> JFlexHTMLStripCharFilterWarcTest.java, LUCENE-3690.patch, LUCENE-3690.patch,
> LUCENE-3690.patch, LUCENE-3690.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and
> easier to understand and maintain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]