[
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753913#action_12753913
]
Jason Rutherglen commented on SOLR-1394:
----------------------------------------
Here's the exception:
{quote}
Caused by: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at
org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
at
org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728)
at
org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742)
at org.apache.lucene.analysis.CharReader.read(CharReader.java:51)
at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451)
at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637)
at
org.apache.lucene.analysis.standard.StandardTokenizer.next(StandardTokenizer.java:198)
at
org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84)
at
org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53)
at
org.apache.solr.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:108)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:245)
at
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:162)
{quote}
> HTML stripper is splitting tokens
> ---------------------------------
>
> Key: SOLR-1394
> URL: https://issues.apache.org/jira/browse/SOLR-1394
> Project: Solr
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.4
> Reporter: Anders Melchiorsen
> Attachments: SOLR-1394.patch
>
>
> I am having problems with the Solr HTML stripper.
> After some investigation, I have found the cause to be that the
> stripper is replacing the removed HTML with spaces. This obviously
> breaks when the HTML is in the middle of a word, like "Günther".
> So, without knowing what I was doing, I hacked together a fix that
> uses offset correction instead.
> That seemed to work, except that closing tags and attributes still
> broke the positioning. With even less of a clue, I replaced read()
> with next() in the two methods handling those.
> Finally, invalid HTML also gave wrong offsets, and I fixed that by
> restoring numRead when rolling back the input stream.
> At this point I stopped trying to break it, so there may still be more
> problems. Or I might have introduced some problem on my own. Anyway, I
> have put the three patches at the bottom of this mail, in case
> somebody wants to move along with this issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.