Re: Improved handling of attributes

Mattmann, Chris A (388J) Wed, 26 May 2010 06:49:39 -0700

Hey Ken,

I wanted to get back to you on this:


> 
> 1. Ability to allow all attributes through from HTML documents
> 
> TIKA-379, building on TIKA-347, allows both more relaxed passing of
> attributes, as well as letting all elements through.
> 
> So if somebody wants to get the "lang" attribute for the <html>
> element of an HTML document, they could do this by using the identity
> mapper.
> 
> Assuming this gets reviewed/committed by Chris, then this issue would
> be solved.

Been busy working on the book, but have had some time to review this. I
think using the hooks in place like the IM in this case make a lot of sense
and I don't see any big barriers to getting this into the sources. Plus
since it's configurable and overridable, we can get the best of both worlds.

> 
> 2. Automatically let all valid XHTML 1.0 attributes through from HTML
> documents
> 
> This would be an improvement, as many consumers of parse output
> wouldn't want to process the raw (unnormalized) elements they'd get
> with the IdentityHtmlMapper, but they would want to get any standard
> attributes.
> 
> I believe this would require changing the DefaultHtmlMapper to "know"
> about valid attributes for different elements. I've filed TIKA-430 for
> this, please take a look and comment.

How is this different from TIKA-379 and TIKA-347? Is it defining a valid set
of attributes that are used to determine what gets through and what doesn't?
If so, I would see this as an extension to TIKA-379 and TIKA-347, or just a
special case of it.

> 
> 3. Make it easier for parsers to correctly add valid attributes to
> XHTML elements.
> 
> For example, the PDF parser might have language data that it would
> like to use to label individual paragraphs.
> 
> I think this would require a utility that takes a normalized element
> name, and a generic attribute name (e.g. something from Dublin Core),
> and returns back what the element attribute name should be (or null,
> if not appropriate).
> 
> I'd like to get some feedback on this approach, before filing an issue.

I'm worried that we're mixing concerns here. Some of the information that
you cite above sounds more to me like metadata (and in fact, thinking about
it, you could argue that attributes themselves on the XHTML amount that
defines the textual structure) are more like metadata attributes too. Where
do you see the delineation?

> 
> 4. Make it possible for parsers to return non-standard attributes
> 
> Andrzej requested this, and suggested the use of namespaces to avoid
> generating invalid XHTML output.
> 
> But currently we strip out namespaces from the source XHTML, for
> example, as they can make processing the resulting data much harder if
> you're using XPath expressions. I don't know if the same would be true
> for clients of Tika. Any thoughts on this?

Like my comment on #3 above, I'm wondering if this should be dealt with in
metadata? What's the use case for making the XHTML more complex by allowing
these attributes?

> 
> 5. Validation of attribute values
> 
> Not sure if this is important, but if we want the XHTML output to be
> valid, then what you can put in an attribute value has some general
> restrictions (e.g. must be quoted) and some specific restrictions
> based on the actual attribute.
> 
> So an open question is whether the mapSafeAttributes() method should
> also take the attribute value, and do simple fixup (quoting) or
> rejection of invalid values. This would mean passing in the attribute
> value, and returning an "attribute record" (or null) in the response,
> to be able to pass back normalized name & value. Again, any thoughts
> on this/

In general, I think that validation makes a lot of sense. My only question
would be where to handle it now (which I'm thinking more reading through
your thoughts -- thanks for them! -- that it might make sense to plumb some
of this through to metadata). WDYT?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Improved handling of attributes

Reply via email to