Hi Andrzej For the second question. I don't think it is the "content size limit amount". In our CMS product, we need to index the content starts from "<!--Indexware Content Starts Here-->" and ends with "<!--Indexware Content Ends Here-->". It is easy to change the HtmlParser ....
/Jack On 4/17/05, Andrzej Bialecki (JIRA) <[EMAIL PROTECTED]> wrote: > [ > http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_62996 ] > > Andrzej Bialecki commented on NUTCH-34: > ---------------------------------------- > > Currently there is such a "registry", and it is built and maintained by > PluginRepository. > > So, it seems to me that the only change required here would be to add > attributes to each plugin config file (and plugin interface) which inform all > plugin users about the following: > > * a boolean, whether the plugin can handle incomplete files or not. > > * an int, setting the content size limit. > > > Parsing different content formats > > --------------------------------- > > > > Key: NUTCH-34 > > URL: http://issues.apache.org/jira/browse/NUTCH-34 > > Project: Nutch > > Type: Improvement > > Components: fetcher > > Reporter: Stephan Strittmatter > > Priority: Trivial > > > > > At the moment Nuch is set up to filter content by config the xml-config > > file. > > There it is also set global how many bytes are loaded. > > I think it yould be better to let the parser plugins "register" themselfe > > in some registry where every plugin could tell the fetcher, that: > > 1. this document type is wanted (because the parser plugin is > > installed and activated) > > 2. how much of the content is required (some plugins need the whole > > content and some not) > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of the administrators: > http://issues.apache.org/jira/secure/Administrators.jspa > - > If you want more information on JIRA, or have a bug to report see: > http://www.atlassian.com/software/jira > >
