[
https://issues.apache.org/jira/browse/TIKA-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bear R Giles updated TIKA-4353:
-------------------------------
Description:
The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements
and attributes. While it's easy for the user to copy-and-paste this file for a
custom parser it requires some effort to understand how to make the required
changes. It's also a one-off effort - this work can't be reused elsewhere.
Given that there's already a dependency on JSoup... a far better solution is to
create a parser that accepts a Safelist instead of using a hardcoded list. This
Safelist can be validated and used elsewhere, and perhaps more importantly it
makes the transition from a jsoup-based solution to a tika-based solution much
more transparent.
NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
h2. Preliminary implementation
I have a preliminary implementation that's not ready for a POC pull request -
yet.
h3. HtmlParserWithSafelist
This parser is a very stripped down copy of the DefaultHtmlParser. It has
removed all existing static elements and replaced them with the appropriate
calls to Safelist methods.
This parser also includes a few proposed improvements:
* it captures 'unsafe' elements and attributes. This allows developers to
finetune their own Safelist implementations
* it adds optional support for the 'data-*' wildcard. This is a HTML5(?)
standard intended to eliminate custom attributes
h3. DefaultHtmlSafelist
The jsoup Safelist already provides a few reference implementations but they
don't fit our needs. This class adds two. In addition it adds support for
wildcard attributes beyond the "data-*" mentioned earlier.
*DEFAULT*
This implementation reproduces the existing behavior with a few improvements
* <source> (since it contains an external reference)
* <form> (since "action" can be an embedded script
* <button> and <input> since they have a "formaction" attribute
* all global attributes
* all form_control, mouse, keyboard, and clipboard events
* <body> and all window events
* <head> (just for completelness with <body>)
IIRC the existing elements have added a few new attributes with HTML5 but I
haven't addressed tha
*HTML5*
This implementation adds many new HTML5 tags, with an emphasis on the tags
that provide semantic context. E.g., <section>, <article>, <time>, etc.
was:
The current DefaultHtmlParser uses a hardcoded list of acceptable HTML elements
and attributes. While it's easy for the user to copy-and-paste this file for a
custom parser it requires some effort to understand how to make the required
changes. It's also a one-off effort - this work can't be reused elsewhere.
Given that there's already a dependency on JSoup... a far better solution is to
create a parser that accepts a Safelist instead of using a hardcoded list. This
Safelist can be validated and used elsewhere, and perhaps more importantly it
makes the transition from a jsoup-based solution to a tika-based solution much
more transparent.
NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
h2. Preliminary implementation
I have a preliminary implementation that's not ready for a POC pull request -
yet.
HtmlParserWithSafelist
This parser is a very stripped down copy of the DefaultHtmlParser. It has
removed all existing static elements and replaced them with the appropriate
calls to Safelist methods.
This parser also includes a few proposed improvements:
* it captures 'unsafe' elements and attributes. This allows developers to
finetune their own Safelist implementations
* it adds optional support for the 'data-*' wildcard. This is a HTML5(?)
standard intended to eliminate custom attributes
h3. DefaultHtmlSafelist
The jsoup Safelist already provides a few reference implementations but they
don't fit our needs. This class adds two. In addition it adds support for
wildcard attributes beyond the "data-*" mentioned earlier.
DEFAULT
This implementation reproduces the existing behavior with a few improvements
* <source> (since it contains an external reference)
* <form> (since "action" can be an embedded script
* <button> and <input> since they have a "formaction" attribute
* all global attributes
* all form_control, mouse, keyboard, and clipboard events
* <body> and all window events
* <head> (just for completelness with <body>)
IIRC the existing elements have added a few new attributes with HTML5 but I
haven't addressed tha
HTML5
This implementation adds many new HTML5 tags, with an emphasis on the tags
that provide semantic context. E.g., <section>, <article>, <time>, etc.
> Implement HtmlParserWithSafelist that uses a standard jsoup Safelist for
> filtering.
> -----------------------------------------------------------------------------------
>
> Key: TIKA-4353
> URL: https://issues.apache.org/jira/browse/TIKA-4353
> Project: Tika
> Issue Type: Improvement
> Reporter: Bear R Giles
> Priority: Minor
>
> The current DefaultHtmlParser uses a hardcoded list of acceptable HTML
> elements and attributes. While it's easy for the user to copy-and-paste this
> file for a custom parser it requires some effort to understand how to make
> the required changes. It's also a one-off effort - this work can't be reused
> elsewhere.
> Given that there's already a dependency on JSoup... a far better solution is
> to create a parser that accepts a Safelist instead of using a hardcoded list.
> This Safelist can be validated and used elsewhere, and perhaps more
> importantly it makes the transition from a jsoup-based solution to a
> tika-based solution much more transparent.
> NOTE: a Safelist is a POJO and NOT limited to just the jsoup parser.
> h2. Preliminary implementation
> I have a preliminary implementation that's not ready for a POC pull request -
> yet.
> h3. HtmlParserWithSafelist
> This parser is a very stripped down copy of the DefaultHtmlParser. It has
> removed all existing static elements and replaced them with the appropriate
> calls to Safelist methods.
> This parser also includes a few proposed improvements:
> * it captures 'unsafe' elements and attributes. This allows developers to
> finetune their own Safelist implementations
> * it adds optional support for the 'data-*' wildcard. This is a HTML5(?)
> standard intended to eliminate custom attributes
> h3. DefaultHtmlSafelist
> The jsoup Safelist already provides a few reference implementations but they
> don't fit our needs. This class adds two. In addition it adds support for
> wildcard attributes beyond the "data-*" mentioned earlier.
> *DEFAULT*
> This implementation reproduces the existing behavior with a few improvements
> * <source> (since it contains an external reference)
> * <form> (since "action" can be an embedded script
> * <button> and <input> since they have a "formaction" attribute
> * all global attributes
> * all form_control, mouse, keyboard, and clipboard events
> * <body> and all window events
> * <head> (just for completelness with <body>)
> IIRC the existing elements have added a few new attributes with HTML5 but I
> haven't addressed tha
> *HTML5*
> This implementation adds many new HTML5 tags, with an emphasis on the tags
> that provide semantic context. E.g., <section>, <article>, <time>, etc.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)