[
https://issues.apache.org/jira/browse/ANY23-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340890#comment-15340890
]
ASF GitHub Bot commented on ANY23-280:
--------------------------------------
Github user lewismc commented on the issue:
https://github.com/apache/any23/pull/24
> There are a large number of whitespace modifications to change from tabs
to 2-space indentation. Is 2-space indentation what Any23 is aiming for, given
that most java code is either tab or 4-space indentation.
I'll revert these changes to 4 spaces as per remainder of codebase and
force an update to this PR.
> If we are going to be modifying the public API we probably should be
aiming for a 2.0 release, otherwise the version numbers are arbitrary
I would have no issues with this as all... it is a v good suggestion.
> Given how broad this pull request is, it needs to be completed before I
can work on some of the issues I have assigned to me.
Agreed. I'll put some time in to it this week and see if I can complete it,
stabilize tests and update the PR for review.
> Refactor ContentExtractor to improve extraction flexibility
> -----------------------------------------------------------
>
> Key: ANY23-280
> URL: https://issues.apache.org/jira/browse/ANY23-280
> Project: Apache Any23
> Issue Type: Improvement
> Components: core, extractors
> Affects Versions: 1.1
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Critical
> Fix For: 1.2
>
>
> As discussed on ANY23-247, the
> [ContentExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44]
> is simply not fit for purpose. This issue was discovered and the cause has
> plagued our builds ever since. Any extractors which implement
> [BaseRDFExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java]
> are based on the Extractor.ContentExtractor and hence work off of an
> 'unfixed' raw data stream as oppose to a more flexible model such as the
> [TagSoupDOMExtractor|https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60].
> This issue should refactor RDF extractors to enable more flexibility and to
> avoid issues we encounter with the strict SAX parsing logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)