[
https://jira.nuxeo.org/browse/NXP-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel resolved NXP-5850.
---------------------------------
Resolution: Fixed
http://hg.nuxeo.org/nuxeo/nuxeo-core/rev/597f70a806ef
> Make the HTML to text converter keep the punctuation and some basic layout
> information.
> ---------------------------------------------------------------------------------------
>
> Key: NXP-5850
> URL: https://jira.nuxeo.org/browse/NXP-5850
> Project: Nuxeo Enterprise Platform
> Issue Type: Bug
> Components: Transforms / Preview
> Affects Versions: 5.3.2
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
> Priority: Major
> Fix For: 5.4
>
>
> Most *_to_text transformer / converter (office files and PDF) implementations
> extract a text representation that features the punctuation and overall
> layout of the document.
> However this is no the case for HTML extractor. This is a problem for
> semantic analysis of Note documents since most analysers requires access to
> the punctuation to detect sentence boundaries. Furtheremore, the current
> implementation extract the following markup:
> """
> <h1>This is the title</h1><p>This is the paragraph.</p>
> """
> as:
> """
> This is the titleThis is the paragraph
> """
> Which makes the fulltext indexer extract the token "titlethis" instead of
> "title" and "this".
> Ideally this should be extracted as:
> """
> This is the title
> This is the paragraph.
> """
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets