Sounds good to me. Otis
--- Chris Mattmann <[EMAIL PROTECTED]> wrote: > Hi Otis, > > Point taken. In actuality since both convey the same information I > think > that it's okay to support both, but by default say we could code the > initial > plugins specified in parse-plugins.xml without the "order=" > attribute. Fair > enough? > > Cheers, > Chris > > > > On 9/15/05 3:23 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > wrote: > > > Well, you have to tell users about order="N" somewhere in the docs. > > Instead of telling them about order="N", tell them that the order > in > > XML matters. Either case requires education, and the latter one > > requires less typing and avoids the case described in the proposal. > > > > Otis > > > > --- Sébastien LE CALLONNEC <[EMAIL PROTECTED]> wrote: > > > >> Hi Otis, > >> > >> > >> This issue arose during our discussion for this proposal, and my > >> feeling was that the XML specification doesn't state that the > order > >> is > >> significant in an XML file. I therefore read the spec again, and > >> indeed didn't find anything on that subject... > >> > >> I think it is somehow reasonable to consider that a parser _might_ > >> return the elements in a different orderâthough, as I mentioned > to > >> Chris & Jerome, that would be quite unheard of, and, to be > honnest, > >> rather irritating. > >> > >> What do you think? > >> > >> > >> Regards, > >> Sebastien. > >> > >> > >> > >>> Quick comment about order="N" and the paragraph that describes > how > >> to > >>> deal with cases where people mess things up and enter multiple > >>> plugins > >>> for the same content type and the same order: > >>> > >>> - Why is the order attribute even needed? It looks like a > >> redundant > >>> piece of information - why not derive order from the order of > >> plugin > >>> definitions in the XML file? > >>> > >>> For instance: > >>> Instead of this: > >>> > >>> <mimeType name="*"> > >>> <plugin id=âparse-textâ order=â1â/> > >>> <plugin id=âanother-one-default-parserâ order=â2â/> > >>> .... > >>> </mimeType> > >>> > >>> We have this: > >>> > >>> <mimeType name="*"> > >>> <plugin id=âparse-textâ/> > >>> <plugin id=âanother-one-default-parserâ/> > >>> .... > >>> </mimeType> > >>> > >>> parse-text first, another-one-default-parser second. Less > typing, > >>> and > >>> we avoid the case of equal ordering all together. > >>> > >>> Otis > >>> > >>> > >>> --- Apache Wiki <[EMAIL PROTECTED]> wrote: > >>> > >>>> Dear Wiki user, > >>>> > >>>> You have subscribed to a wiki page or wiki category on "Nutch > >> Wiki" > >>>> for change notification. > >>>> > >>>> The following page has been changed by ChrisMattmann: > >>>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal > >>>> > >>>> The comment on the change is: > >>>> Initial Draft of ParserFactoryImprovementProposal > >>>> > >>>> New page: > >>>> = Parser Factory Improvement Proposal = > >>>> > >>>> > >>>> == Summary of Issue == > >>>> Currently Nutch provides a plugin mechanism wherein plugins > >>> register > >>>> certain metadata about themselves, including their id, > classname, > >>> and > >>>> so forth. In particular, the set of parsing plugins register > >> which > >>>> contentTypes and file suffixes they can support with a > >>>> PluginRepository. > >>>> > >>>> One ââ¬Åadopted practiceââ¬ï¿½ in current Nutch parsing > plugins > >>>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) > >> has > >>>> also been to verify that the content type passed to it during a > >>> fetch > >>>> is indeed one of the contentTypes that it supports (be it > >>>> application/xml, or application/pdf, etc.). This practice is > >>>> cumbersome for a few reasons: > >>>> > >>>> *Any updates to supported content types for a parsing plugin > >> will > >>>> require a recompilation of the plugin code > >>>> *Checking for ââ¬Åhard codedââ¬ï¿½ content types within > the parsing > >>>> plugin is a duplication of information that already exists in > the > >>>> pluginââ¬â¢s descriptor file, plugin.xml > >>>> *By the time that content gets to a parsing plugin, (e.g., the > >>>> parsing plugin is returned by the ParserFactory, and provided > >>> content > >>>> during a fetch), the ParsingFactory should have already ensured > >>> that > >>>> the appropriate plugin is getting called for a particular > >>>> contentType. > >>>> > >>>> In addition to this problem is the fact that several parsing > >>> plugins > >>>> may all support many of the same content types. For instance, > the > >>>> parse-js plugin may be the only well suited parsing plugin for > >>>> javascript, but perhaps it may also provided a good enough > >>> heuristic > >>>> parser for plain text as well, and so it may support both types. > >>>> However, there may be a parsing plugin for text (which there > >> is!), > >>>> parse-text, whose primary purpose is to parse plain text as > well. > >>>> > >>>> == Suggested Remedy == > >>>> To deal with ensuring the desired parsing plugin is called for > >> the > >>>> appropriate content type, and to in effect, ââ¬Åkill two > birds > >> with > >>>> one stoneââ¬ï¿½, we propose that there be a parsing plugin > >> preference > >>>> list for each content type that Nutch knows how to handle, i.e., > >>> each > >>>> content type available via the mimeType system. Therefore, > during > >> a > >>>> fetch, once the appropriate mimeType has been determined for > >>> content, > >>>> and the ParserFactory is tasked with returning a parsing plugin, > >>> the > >>>> ParserFactory should consult a preference list for that > >>> contentType, > >>>> allowing it to determine which plugin has the highest preference > >>> for > >>>> the contentType. That parsing plugin should be returned via the > >>>> ParserFactory to the fetcher. If there is any problem using the > >>>> initial returned parsing plugin for a particular contentType > >> (i.e., > >>>> if a ParseException is throw during the parser, or a null > >>> ParseStatus > >>>> is returned), then the ParserFactory should be called again, > this > >>>> time asking for the ââ¬Ånext highest ranked > >>>> ââ¬ï¿½ plugin for that contentType. Such a process should > repeat on > >>> and > >>>> on until the parse is successful. > >>>> > >>>> We propose that the ââ¬Åplugin preference listââ¬ï¿½ should > be a > >>> separate > >>>> file that lives in $NUTCH_HOME/conf called > >> ââ¬Åparse-plugins.xmlââ¬ï¿½. > >>>> The format of the file (full DTD to be developed during coding) > >>>> should be something like: {{{ > >>>> > >>>> <parse-plugins> > >>>> <default pluginname=ââ¬ï¿½parse-textââ¬ï¿½/> > >>>> <fileType name=ââ¬ï¿½powerpointââ¬ï¿½> > >>>> <mimeTypes> > >>>> <mimeType name=ââ¬ï¿½application/pdfââ¬ï¿½ /> > >>>> <mimeType name=ââ¬ï¿½application/x-pdfââ¬ï¿½ /> > >>>> ââ¬Â¦ > >>>> </mimeTypes> > >>>> > >>>> <plugins> > >>>> > >>>> <plugin name=ââ¬ï¿½parse-pdfââ¬ï¿½ > order=ââ¬ï¿½1ââ¬ï¿½/> > >>>> <plugin name=ââ¬ï¿½parse-pdf-worseââ¬ï¿½ > order=ââ¬ï¿½2ââ¬ï¿½/> > >>>> ââ¬Â¦ > >>>> </plugins> > >>>> </fileType> > >>>> ââ¬Â¦ > >>>> </parse-plugins> > >>>> > >>>> }}} > >>>> > >>>> > >>>> One of the main impacts of having a file like parse-plugins.xml > >> is > >>>> that no longer should the pathSuffix="" be part of the > plugin.xml > >>>> descriptor. We propose to move that out of plugin.xml and into > >> the > >>>> mime-types.xml file. > >>>> > >>>> == Architectural Impact == > >>>> > >>>> === Components === > >>>> *Fetcher > >>>> *PluginSystem > >>>> *ParserFactory > >>>> > >>>> === Impact on current releases of Nutch === > >>>> > >>>> ''Incompatibilities'' > >>>> > >>>> By moving the contentType and pathSuffix out of the plugin.xml > >>> file, > >>>> this would create an updated version of the plugin.xml > descriptor > >>>> schema for each plugin. To lessen the effect on previous and > >>>> near-term releases of Nutch this information could be left as an > >>>> option in the plugin.xml schema, but marked as > ââ¬Ådeprecatedââ¬ï¿½ > >> to > >>>> let people know that this functionality isnââ¬â¢t part of the > parse > >>>> plugin identification process anymore, but it is left in the > >> schema > >>>> so as not to create incompatibilities with the plugin.xml files > >>> that > >>>> people have already wrote. However, ultimately in future > releases > >>> of > >>>> Nutch, we propose that the contentType and pathSuffix attributes > >>>> should be removed from the plugin.xml schema. > >>>> > >>>> Other than the plugin.xml file schema change, this capability > >>>> addition will simply control the order in which parsing plugins > >> get > >>>> called during fetching activities. It wonââ¬â¢t directly > impact the > >>>> segments stored, or the webapp, or any of the main components of > >>>> Nutch. > >>>> > >>>> ''Issues'' > >>>> > >>>> The proposed new capabilities should be first tested on local > >>>> systems, and if successful, uploaded to JIRA, and verified > >> against > >>>> the latest SVNs. > >>>> Unit tests should be written to verify appropriate plugin > parsing > >>>> order. > >>>> Users will need to be notified in the Nutch tutorial and > >>> instruction > >>>> lists about how to set up the parsing plugin preferences prior > to > >>>> performing a fetch. > >>>> > >>>> == Personnel == > >>>> > >>>> *Jerome Charron > >>>> *Sébastien Le Callonnec > >>>> *Chris A. Mattmann > >>>> > >>>> == Timeframe == > >>>> > >>> > >> === message truncated === > >> > >> > >> > >> > >> > >> > >> > >> > > > ___________________________________________________________________________ > >> > >> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! > >> Messenger > >> Téléchargez cette version sur http://fr.messenger.yahoo.com > >> > > > > ______________________________________________ > Chris A. Mattmann > [EMAIL PROTECTED] > Staff Member > Modeling and Data Management Systems Section (387) > Data Management Systems and Technologies Group > > _________________________________________________ > Jet Propulsion Laboratory Pasadena, CA > Office: 171-266B Mailstop: 171-246 > Phone: 818-354-8810 > _______________________________________________________ > > Disclaimer: The opinions presented within are my own and do not > reflect > those of either NASA, JPL, or the California Institute of Technology. > > > > > > ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
