[
https://issues.apache.org/jira/browse/SOLR-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe updated SOLR-6856:
-----------------------------
Attachment: SOLR-6856.patch
The {{<div>}} capture issue (also a mapping issue since there's nothing to map)
is that in Tika 0.6 (Solr 3.1 upgraded Tika from 0.4 to 0.8), a
{{DefaultHtmlMapper}} was introduced that only creates events for a subset of
HTML tags - when
http://tika.apache.org/1.7/api/org/apache/tika/parser/html/HtmlMapper.html#mapSafeElement(java.lang.String)
returns null (as it does for any non-mapped tags), its child content is
processed, but no event is created for the tag. {{HtmlParser}} uses
{{DefaultHtmlMapper}} if no {{HtmlMapper.class}} mapping is supplied with
{{ParseContext}}. Here's the 1.7 {{DefaultHtmlMapper.SAFE_ELEMENTS}}
definition, where the mappings are initialized:
[http://svn.apache.org/viewvc/tika/tags/1.7/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java?view=markup#l33]
- no {{<div>}} in there.
The attached patch maps the {{HtmlMapper.class}} in {{ParseContext}} to
[{{IdentityHtmlMapper}}|http://tika.apache.org/1.7/api/org/apache/tika/parser/html/IdentityHtmlMapper.html],
which creates events for every HTML element. A new test is added to check
that elements including {{<div>}} are captured and mapped properly.
> regression in /update/extract ? ref guide examples of fmap & xpath don't seem
> to be working
> --------------------------------------------------------------------------------------------
>
> Key: SOLR-6856
> URL: https://issues.apache.org/jira/browse/SOLR-6856
> Project: Solr
> Issue Type: Bug
> Affects Versions: 5.0
> Reporter: Hoss Man
> Priority: Blocker
> Attachments: SOLR-6856.patch
>
>
> I updated this page to know about hte new bin/solr and example/exampledocs
> structure/contents...
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> however i noticed that several of the examples listed on that page didn't
> seem to work any more -- notably...
> * examples using "fmap" don't seem to create the fields they say they will
> * examples using "xpath" don't seem to create any docs at all
> Specific examples i had problems with...
> {noformat}
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> curl
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&commit=true"
> -F "sample=@example/exampledocs/sample.html"
> {noformat}
> ...none of these example commands produced an error, but they also didn't
> seem to create the fields/docs they said they would (ie: no "foo_t" field was
> created)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]