[ 
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-3690:
--------------------------------

    Attachment: HTMLStripCharFilterWarcTest.java
                JFlexHTMLStripCharFilterWarcTest.java
                BaselineWarcTest.java
                LUCENE-3690.patch

bq. some test docs using typical wellformed html markup

I have access to [ClueWeb09|http://lemurproject.org/clueweb09.php].  For 
performance testing I used the first WARC file for the English and Chinese 
languages ({{en0000/00.warc.gz}} and {{zh0000/00.warc.gz}}), each of which when 
uncompressed contains about 1GB of text (including a small amount of non-HTML 
metadata: WARC information and HTTP headers).  The English WARC contains about 
35,000 documents from about 2,100 unique domains.  The Chinese WARC contains 
about 33,000 documents from about 550 unique domains.

I compared {{JFlexHTMLStripCharFilter}}'s output with that of 
{{HTMLStripCharFilter}} for several hundred documents.  In the course of this 
comparison, I found several problems with the JFlex implementation (e.g. no 
{{<STYLE>}} tag handling; no MS conditional tag handling, e.g. {{<!\[if ! 
IE]>}}; and some problems handling creative attribute values), which the 
attached patch fixes.  I re-ran the text-only and malformed HTML performance 
tests on the final implementation, and the numbers aren't significantly 
different from those prior to these fixes.  The new patch also contains the 
more-evil {{_TestUtils.randomHtmlishString()}};  shifts the {{CharFilter}} 
javadocs from {{BaseCharFilter.addOffCorrectMapping()}} to {{package.html}}; 
and adds several more tests to {{JFlexHTMLStripCharFilterTest.java}}.

I have attached the three classes I used to test performance over the ClueWeb09 
subset.  {{BaselineWarcTest.java}} uses [the WarcRecord class supplied with the 
ClueWeb09 
collection|http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Working+with+WARC+Files]
 to read the compressed WARC files; looks for a declared charset first in each 
document's content in the Content-Type {{<meta>}} tag, and then in the HTTP 
header; feeds this charset, if any, to the ICU4J charset detector, which 
instantiates a Reader using the detected charset; and then {{read()'s}} all 
content.  The other two classes add the respective CharFilter on top of 
{{BaselineWarcTest}}'s functionality.

The performance numbers (best of 5 trials):

||Language||Baseline||Classic||JFlex||
|English|156s|179s|171s|
|Chinese|155s|180s|172s|

Excluding charset detection and I/O (measured by {{BaselineWarcTest}}), 
{{JFlexHTMLStripCharFilter}} appears to improve on {{HTMLStripCharFilter}}'s 
throughput by about 50% in both languages.

I found a few problems with {{HTMLStripCharFilter}}:

# The following exception was thrown for six of the English documents:
{noformat}
java.io.IOException: Mark invalid
        at java.io.BufferedReader.reset(BufferedReader.java:485)
        at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
        at 
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
        at 
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
        at HTMLStripCharFilterWarcTest.main(HTMLStripCharFilterWarcTest.java:86)
{noformat}
# {{&amp;apos;}} is not decoded.
# Content between some <script> tags is not stripped out.
# Unbalanced quotation marks in opening tags cause the tag to not be stripped 
out.

Left to do:

# Rename {{HTMLStripCharFilter}} to {{ClassicHTMLStripCharFilter}}; move it to 
Solr {{o.a.s.analysis}} package; deprecate it; and create a new Solr Factory 
for it.
# Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
# Commit to trunk
# Backport and commit to branch_3x.
                
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
>                 Key: LUCENE-3690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3690
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 3.5, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java, 
> JFlexHTMLStripCharFilterWarcTest.java, LUCENE-3690.patch, LUCENE-3690.patch, 
> LUCENE-3690.patch, LUCENE-3690.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and 
> easier to understand and maintain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to