Lewis John McGibbney created ANY23-280:
------------------------------------------
Summary: Restructure ContentExtractor to improve extraction
flexibility
Key: ANY23-280
URL: https://issues.apache.org/jira/browse/ANY23-280
Project: Apache Any23
Issue Type: Improvement
Components: core, extractors
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
Fix For: 1.2
As discussed on ANY23-247, the
[ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
is simply not fit for purpose. This issue was discovered and the cause has
plagued our builds ever since. Any extractors which implement
[BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
are based on the Extractor.ContentExtractor and hence work off of an 'unfixed'
raw data stream as oppose to a more flexible model such as the
[TagSoupDOMExtractorhttps://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
This issue should restructure RDF extractors to enable more flexibility and to
avoid issues we encounter with the strict SAX parsing logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)