Re: Improved handling of attributes

Ken Krugler Thu, 27 May 2010 09:17:16 -0700

Hi Chris,

On May 26, 2010, at 6:49am, Mattmann, Chris A (388J) wrote:

Hey Ken,

I wanted to get back to you on this:
1. Ability to allow all attributes through from HTML documents

TIKA-379, building on TIKA-347, allows both more relaxed passing of
attributes, as well as letting all elements through.

So if somebody wants to get the "lang" attribute for the <html>
element of an HTML document, they could do this by using the identity
mapper.

Assuming this gets reviewed/committed by Chris, then this issue would
be solved.
Been busy working on the book, but have had some time to reviewthis. Ithink using the hooks in place like the IM in this case make a lotof senseand I don't see any big barriers to getting this into the sources.Plussince it's configurable and overridable, we can get the best of bothworlds.
2. Automatically let all valid XHTML 1.0 attributes through from HTML
documents

This would be an improvement, as many consumers of parse output
wouldn't want to process the raw (unnormalized) elements they'd get
with the IdentityHtmlMapper, but they would want to get any standard
attributes.

I believe this would require changing the DefaultHtmlMapper to "know"
about valid attributes for different elements. I've filed TIKA-430for
this, please take a look and comment.
How is this different from TIKA-379 and TIKA-347? Is it defining avalid setof attributes that are used to determine what gets through and whatdoesn't?If so, I would see this as an extension to TIKA-379 and TIKA-347, orjust a
special case of it.

I could piggy-back on TIKA-379, but I'd rather that get committedfirst, since it's Julien's patch. Afterwards it should be pretty easyfor me to add tests & support for some number of attributes - not sureif I've got what it takes to dump in the entire set of validattributes right away :)

3. Make it easier for parsers to correctly add valid attributes to
XHTML elements.

For example, the PDF parser might have language data that it would
like to use to label individual paragraphs.

I think this would require a utility that takes a normalized element
name, and a generic attribute name (e.g. something from Dublin Core),
and returns back what the element attribute name should be (or null,
if not appropriate).
I'd like to get some feedback on this approach, before filing anissue.
I'm worried that we're mixing concerns here. Some of the informationthatyou cite above sounds more to me like metadata (and in fact,thinking aboutit, you could argue that attributes themselves on the XHTML amountthatdefines the textual structure) are more like metadata attributestoo. Where
do you see the delineation?

Same as Jukka - if you have per-element attributes, then you can't(cleanly) represent these in metadata.

Metadata is at the document level, attributes are at the element/taglevel.

4. Make it possible for parsers to return non-standard attributes

Andrzej requested this, and suggested the use of namespaces to avoid
generating invalid XHTML output.

But currently we strip out namespaces from the source XHTML, for
example, as they can make processing the resulting data much harderifyou're using XPath expressions. I don't know if the same would betrue
for clients of Tika. Any thoughts on this?
Like my comment on #3 above, I'm wondering if this should be dealtwith inmetadata? What's the use case for making the XHTML more complex byallowing
these attributes?


See above.

5. Validation of attribute values

Not sure if this is important, but if we want the XHTML output to be
valid, then what you can put in an attribute value has some general
restrictions (e.g. must be quoted) and some specific restrictions
based on the actual attribute.

So an open question is whether the mapSafeAttributes() method should
also take the attribute value, and do simple fixup (quoting) or
rejection of invalid values. This would mean passing in the attribute
value, and returning an "attribute record" (or null) in the response,
to be able to pass back normalized name & value. Again, any thoughts
on this/

In general, I think that validation makes a lot of sense. My onlyquestionwould be where to handle it now (which I'm thinking more readingthroughyour thoughts -- thanks for them! -- that it might make sense toplumb some

of this through to metadata). WDYT?

Same as above - though I hadn't considered the case of validatingmetadata values as well.


-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Improved handling of attributes

Reply via email to