Make the HTML to text converter keep the punctuation and some basic layout
information.
---------------------------------------------------------------------------------------
Key: NXP-5850
URL: https://jira.nuxeo.org/browse/NXP-5850
Project: Nuxeo Enterprise Platform
Issue Type: Bug
Components: Transforms / Preview
Affects Versions: 5.3.2
Reporter: Olivier Grisel
Assignee: Olivier Grisel
Priority: Major
Fix For: 5.4
Most *_to_text transformer / converter (office files and PDF) implementations
extract a text representation that features the punctuation and overall layout
of the document.
However this is no the case for HTML extractor. This is a problem for semantic
analysis of Note documents since most analysers requires access to the
punctuation to detect sentence boundaries. Furtheremore, the current
implementation extract the following markup:
"""
<h1>This is the title</h1><p>This is the paragraph.</p>
"""
as:
"""
This is the titleThis is the paragraph
"""
Which makes the fulltext indexer extract the token "titlethis" instead of
"title" and "this".
Ideally this should be extracted as:
"""
This is the title
This is the paragraph.
"""
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets