[ 
https://issues.apache.org/jira/browse/SOLR-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289762#comment-14289762
 ] 

Steve Rowe edited comment on SOLR-6856 at 1/23/15 7:15 PM:
-----------------------------------------------------------

The {{<div>}} capture issue (also a mapping issue since there's nothing to map) 
is that in Tika 0.6 (Solr 3.1 upgraded Tika from 0.4 to 0.8), a 
{{DefaultHtmlMapper}} was introduced that only creates events for a subset of 
HTML tags - when 
[{{HtmlMapper.mapSafeElement()}}|http://tika.apache.org/1.7/api/org/apache/tika/parser/html/HtmlMapper.html#mapSafeElement(java.lang.String)]
 returns null for a tag (as it does for any non-mapped tags), its child content 
is processed, but no event is created for it.  {{HtmlParser}} uses 
{{DefaultHtmlMapper}} if no {{HtmlMapper.class}} mapping is supplied with 
{{ParseContext}}.  Here's the 1.7 {{DefaultHtmlMapper.SAFE_ELEMENTS}} 
definition, where the mappings are initialized: 
[http://svn.apache.org/viewvc/tika/tags/1.7/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java?view=markup#l33]
 - no {{<div>}} in there.

The attached patch maps the {{HtmlMapper.class}} in {{ParseContext}} to 
[{{IdentityHtmlMapper}}|http://tika.apache.org/1.7/api/org/apache/tika/parser/html/IdentityHtmlMapper.html],
 which creates events for every HTML element.  A new test is added to check 
that elements including {{<div>}} are captured and mapped properly.


was (Author: steve_rowe):
The {{<div>}} capture issue (also a mapping issue since there's nothing to map) 
is that in Tika 0.6 (Solr 3.1 upgraded Tika from 0.4 to 0.8), a 
{{DefaultHtmlMapper}} was introduced that only creates events for a subset of 
HTML tags - when 
http://tika.apache.org/1.7/api/org/apache/tika/parser/html/HtmlMapper.html#mapSafeElement(java.lang.String)
 returns null (as it does for any non-mapped tags), its child content is 
processed, but no event is created for the tag.  {{HtmlParser}} uses 
{{DefaultHtmlMapper}} if no {{HtmlMapper.class}} mapping is supplied with 
{{ParseContext}}.  Here's the 1.7 {{DefaultHtmlMapper.SAFE_ELEMENTS}} 
definition, where the mappings are initialized: 
[http://svn.apache.org/viewvc/tika/tags/1.7/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java?view=markup#l33]
 - no {{<div>}} in there.

The attached patch maps the {{HtmlMapper.class}} in {{ParseContext}} to 
[{{IdentityHtmlMapper}}|http://tika.apache.org/1.7/api/org/apache/tika/parser/html/IdentityHtmlMapper.html],
 which creates events for every HTML element.  A new test is added to check 
that elements including {{<div>}} are captured and mapped properly.

> regression in /update/extract ? ref guide examples of fmap & xpath don't seem 
> to be working 
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6856
>                 URL: https://issues.apache.org/jira/browse/SOLR-6856
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 5.0
>            Reporter: Hoss Man
>            Priority: Blocker
>         Attachments: SOLR-6856.patch
>
>
> I updated this page to know about hte new bin/solr and example/exampledocs 
> structure/contents...
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> however i noticed that several of the examples listed on that page didn't 
> seem to work any more -- notably...
> * examples using "fmap" don't seem to create the fields they say they will
> * examples using "xpath" don't seem to create any docs at all
> Specific examples i had problems with...
> {noformat}
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div&commit=true";
>  -F "sample=@example/exampledocs/sample.html"
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&commit=true";
>  -F "sample=@example/exampledocs/sample.html"
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah&commit=true";
>  -F "sample=@example/exampledocs/sample.html"
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&commit=true"
>  -F "sample=@example/exampledocs/sample.html"
> {noformat}
> ...none of these example commands produced an error, but they also didn't 
> seem to create the fields/docs they said they would (ie: no "foo_t" field was 
> created)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to