[Nutch-dev] Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

ogjunk-nutch Thu, 15 Sep 2005 22:01:04 -0700

Sounds good to me.

Otis


--- Chris Mattmann <[EMAIL PROTECTED]> wrote:

> Hi Otis,
> 
>  Point taken. In actuality since both convey the same information I
> think
> that it's okay to support both, but by default say we could code the
> initial
> plugins specified in parse-plugins.xml without the "order="
> attribute. Fair
> enough?
> 
> Cheers,
>   Chris
> 
> 
> 
> On 9/15/05 3:23 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> wrote:
> 
> > Well, you have to tell users about order="N" somewhere in the docs.
> > Instead of telling them about order="N", tell them that the order
> in
> > XML matters.  Either case requires education, and the latter one
> > requires less typing and avoids the case described in the proposal.
> > 
> > Otis
> > 
> > --- SÃ©bastien LE CALLONNEC <[EMAIL PROTECTED]> wrote:
> > 
> >> Hi Otis,
> >> 
> >> 
> >> This issue arose during our discussion for this proposal, and my
> >> feeling was that the XML specification doesn't state that the
> order
> >> is
> >> significant in an XML file.  I therefore read the spec again, and
> >> indeed didn't find anything on that subject...
> >> 
> >> I think it is somehow reasonable to consider that a parser _might_
> >> return the elements in a different orderâthough, as I mentioned
> to
> >> Chris & Jerome, that would be quite unheard of, and, to be
> honnest,
> >> rather irritating.
> >> 
> >> What do you think?
> >> 
> >> 
> >> Regards,
> >> Sebastien.
> >> 
> >> 
> >> 
> >>> Quick comment about order="N" and the paragraph that describes
> how
> >> to
> >>> deal with cases where people mess things up and enter multiple
> >>> plugins
> >>> for the same content type and the same order:
> >>> 
> >>> - Why is the order attribute even needed?  It looks like a
> >> redundant
> >>> piece of information - why not derive order from the order of
> >> plugin
> >>> definitions in the XML file?
> >>> 
> >>> For instance:
> >>> Instead of this:
> >>> 
> >>>   <mimeType name="*">
> >>>       <plugin id=âparse-textâ order=â1â/>
> >>>       <plugin id=âanother-one-default-parserâ order=â2â/>
> >>>      ....
> >>>   </mimeType>
> >>> 
> >>> We have this:
> >>> 
> >>>   <mimeType name="*">
> >>>       <plugin id=âparse-textâ/>
> >>>       <plugin id=âanother-one-default-parserâ/>
> >>>      ....
> >>>   </mimeType>
> >>> 
> >>> parse-text first, another-one-default-parser second.  Less
> typing,
> >>> and
> >>> we avoid the case of equal ordering all together.
> >>> 
> >>> Otis
> >>> 
> >>> 
> >>> --- Apache Wiki <[EMAIL PROTECTED]> wrote:
> >>> 
> >>>> Dear Wiki user,
> >>>> 
> >>>> You have subscribed to a wiki page or wiki category on "Nutch
> >> Wiki"
> >>>> for change notification.
> >>>> 
> >>>> The following page has been changed by ChrisMattmann:
> >>>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
> >>>> 
> >>>> The comment on the change is:
> >>>> Initial Draft of ParserFactoryImprovementProposal
> >>>> 
> >>>> New page:
> >>>> = Parser Factory Improvement Proposal =
> >>>> 
> >>>> 
> >>>> == Summary of Issue ==
> >>>> Currently Nutch provides a plugin mechanism wherein plugins
> >>> register
> >>>> certain metadata about themselves, including their id,
> classname,
> >>> and
> >>>> so forth. In particular, the set of parsing plugins register
> >> which
> >>>> contentTypes and file suffixes they can support with a
> >>>> PluginRepository.
> >>>> 
> >>>> One Ã¢â¬Åadopted practiceÃ¢â¬ï¿½ in current Nutch parsing
> plugins
> >>>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
> >> has
> >>>> also been to verify that the content type passed to it during a
> >>> fetch
> >>>> is indeed one of the contentTypes that it supports (be it
> >>>> application/xml, or application/pdf, etc.). This practice is
> >>>> cumbersome for a few reasons:
> >>>> 
> >>>>  *Any updates to supported content types for a parsing plugin
> >> will
> >>>> require a recompilation of the plugin code
> >>>>  *Checking for Ã¢â¬Åhard codedÃ¢â¬ï¿½ content types within
> the parsing
> >>>> plugin is a duplication of information that already exists in
> the
> >>>> pluginÃ¢â¬â¢s descriptor file, plugin.xml
> >>>>  *By the time that content gets to a parsing plugin, (e.g., the
> >>>> parsing plugin is returned by the ParserFactory, and provided
> >>> content
> >>>> during a fetch), the ParsingFactory should have already ensured
> >>> that
> >>>> the appropriate plugin is getting called for a particular
> >>>> contentType.
> >>>> 
> >>>> In addition to this problem is the fact that several parsing
> >>> plugins
> >>>> may all support many of the same content types. For instance,
> the
> >>>> parse-js plugin may be the only well suited parsing plugin for
> >>>> javascript, but perhaps it may also provided a good enough
> >>> heuristic
> >>>> parser for plain text as well, and so it may support both types.
> >>>> However, there may be a parsing plugin for text (which there
> >> is!),
> >>>> parse-text, whose primary purpose is to parse plain text as
> well.
> >>>> 
> >>>> == Suggested Remedy ==
> >>>> To deal with ensuring the desired parsing plugin is called for
> >> the
> >>>> appropriate content type, and to in effect, Ã¢â¬Åkill two
> birds
> >> with
> >>>> one stoneÃ¢â¬ï¿½, we propose that there be a parsing plugin
> >> preference
> >>>> list for each content type that Nutch knows how to handle, i.e.,
> >>> each
> >>>> content type available via the mimeType system. Therefore,
> during
> >> a
> >>>> fetch, once the appropriate mimeType has been determined for
> >>> content,
> >>>> and the ParserFactory is tasked with returning a parsing plugin,
> >>> the
> >>>> ParserFactory should consult a preference list for that
> >>> contentType,
> >>>> allowing it to determine which plugin has the highest preference
> >>> for
> >>>> the contentType. That parsing plugin should be returned via the
> >>>> ParserFactory to the fetcher. If there is any problem using the
> >>>> initial returned parsing plugin for a particular contentType
> >> (i.e.,
> >>>> if a ParseException is throw during the parser, or a null
> >>> ParseStatus
> >>>> is returned), then the ParserFactory should be called again,
> this
> >>>> time asking for the Ã¢â¬Ånext highest ranked
> >>>>  Ã¢â¬ï¿½ plugin for that contentType. Such a process should
> repeat on
> >>> and
> >>>> on until the parse is successful.
> >>>> 
> >>>> We propose that the Ã¢â¬Åplugin preference listÃ¢â¬ï¿½ should
> be a
> >>> separate
> >>>> file that lives in $NUTCH_HOME/conf called
> >> Ã¢â¬Åparse-plugins.xmlÃ¢â¬ï¿½.
> >>>> The format of the file (full DTD to be developed during coding)
> >>>> should be something like: {{{
> >>>> 
> >>>> <parse-plugins>
> >>>>   <default pluginname=Ã¢â¬ï¿½parse-textÃ¢â¬ï¿½/>
> >>>>   <fileType name=Ã¢â¬ï¿½powerpointÃ¢â¬ï¿½>
> >>>>    <mimeTypes>
> >>>>     <mimeType name=Ã¢â¬ï¿½application/pdfÃ¢â¬ï¿½ />
> >>>>     <mimeType name=Ã¢â¬ï¿½application/x-pdfÃ¢â¬ï¿½ />
> >>>>     Ã¢â¬Â¦
> >>>>    </mimeTypes>
> >>>> 
> >>>>    <plugins>
> >>>> 
> >>>>       <plugin name=Ã¢â¬ï¿½parse-pdfÃ¢â¬ï¿½
> order=Ã¢â¬ï¿½1Ã¢â¬ï¿½/>
> >>>>       <plugin name=Ã¢â¬ï¿½parse-pdf-worseÃ¢â¬ï¿½
> order=Ã¢â¬ï¿½2Ã¢â¬ï¿½/>
> >>>>      Ã¢â¬Â¦
> >>>>    </plugins>
> >>>>   </fileType>
> >>>>     Ã¢â¬Â¦
> >>>> </parse-plugins>
> >>>> 
> >>>> }}}
> >>>> 
> >>>> 
> >>>> One of the main impacts of having a file like parse-plugins.xml
> >> is
> >>>> that no longer should the pathSuffix="" be part of the
> plugin.xml
> >>>> descriptor. We propose to move that out of plugin.xml and into
> >> the
> >>>> mime-types.xml file.
> >>>> 
> >>>> == Architectural Impact ==
> >>>> 
> >>>> === Components ===
> >>>>  *Fetcher
> >>>>  *PluginSystem
> >>>>  *ParserFactory
> >>>> 
> >>>> === Impact on current releases of Nutch ===
> >>>> 
> >>>> ''Incompatibilities''
> >>>> 
> >>>> By moving the contentType and pathSuffix out of the plugin.xml
> >>> file,
> >>>> this would create an updated version of the plugin.xml
> descriptor
> >>>> schema for each plugin. To lessen the effect on previous and
> >>>> near-term releases of Nutch this information could be left as an
> >>>> option in the plugin.xml schema, but marked as
> Ã¢â¬ÅdeprecatedÃ¢â¬ï¿½
> >> to
> >>>> let people know that this functionality isnÃ¢â¬â¢t part of the
> parse
> >>>> plugin identification process anymore, but it is left in the
> >> schema
> >>>> so as not to create incompatibilities with the plugin.xml files
> >>> that
> >>>> people have already wrote. However, ultimately in future
> releases
> >>> of
> >>>> Nutch, we propose that the contentType and pathSuffix attributes
> >>>> should be removed from the plugin.xml schema.
> >>>> 
> >>>> Other than the plugin.xml file schema change, this capability
> >>>> addition will simply control the order in which parsing plugins
> >> get
> >>>> called during fetching activities. It wonÃ¢â¬â¢t directly
> impact the
> >>>> segments stored, or the webapp, or any of the main components of
> >>>> Nutch.
> >>>> 
> >>>> ''Issues''
> >>>> 
> >>>> The proposed new capabilities should be first tested on local
> >>>> systems, and if successful, uploaded to JIRA, and verified
> >> against
> >>>> the latest SVNs.
> >>>> Unit tests should be written to verify appropriate plugin
> parsing
> >>>> order.
> >>>> Users will need to be notified in the Nutch tutorial and
> >>> instruction
> >>>> lists about how to set up the parsing plugin preferences prior
> to
> >>>> performing a fetch.
> >>>> 
> >>>> == Personnel ==
> >>>> 
> >>>>  *Jerome Charron
> >>>>  *SÃÂ©bastien Le Callonnec
> >>>>  *Chris A. Mattmann
> >>>> 
> >>>> == Timeframe ==
> >>>> 
> >>> 
> >> === message truncated ===
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >
>
___________________________________________________________________________
> >> 
> >> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> >> Messenger 
> >> TÃ©lÃ©chargez cette version sur http://fr.messenger.yahoo.com
> >> 
> > 
> 
> ______________________________________________
> Chris A. Mattmann
> [EMAIL PROTECTED]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>  
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> Phone:  818-354-8810
> _______________________________________________________
>  
> Disclaimer:  The opinions presented within are my own and do not
> reflect
> those of either NASA, JPL, or the California Institute of Technology.
>  
>  
> 
> 
> 
> 



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. 
Download it for free - -and be entered to win a 42" plasma tv or your very
own Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

Reply via email to