[Nutch Wiki] Trivial Update of PluginCentral by AlexM c
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PluginCentral page has been changed by AlexMc. The comment on this change is: typo. http://wiki.apache.org/nutch/PluginCentral?action=diffrev1=61rev2=62 -- * [[WritingPluginExample-0.9]] - Step-by-step example of how to write a plugin for the current development. * WritingPluginExample - A step-by-step example of how to write a plugin for the 0.7 branch. - updated by LucasBoullosa * [[http://wiki.media-style.com/display/nutchDocu/Write+a+plugin|Writing Plugins]] - by Stefan - * [[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example of writing a custom plugin] by Sujitpal + * [[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example of writing a custom plugin]] by Sujitpal * [[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a plugin to add dates]] by Ryan Pfister == Plugins that Come with Nutch (0.9) ==
[jira] Created: (NUTCH-843) Separate the build and runtime environments
Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885967#action_12885967 ] Chris A. Mattmann commented on NUTCH-843: - Super +1 I've wanted to do something like this for a looong time http://markmail.org/thread/osmfz6pknr4n4unf ;) Let me think about the deployment structure a little bit and comment back on this issue... Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-843: Attachment: NUTCH-843.patch This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and /runtime/local areas, populated with the right pieces. bin/nutch has been modified to work correctly in both cases. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Parse-tika ignores too much data...
Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in head section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
[jira] Commented: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886012#action_12886012 ] Chris A. Mattmann commented on NUTCH-843: - Hey Andrzej: Wouldn't my proposed deployment structure in theory be equivalent to say creating a .job file as you proposed above? You can think of the proposed dir structure as an exploded version of the unpacked .job? Cheers, Chris Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015 ] Andrzej Bialecki commented on NUTCH-843: - We need to create the job file anyway. Actually, the patch I attached does something like this for the local setup (lib/ is flattened), but still I would argue for setting up two areas, /runtime/deploy and /runtime/local - it's painfully obvious then what parts you need to deploy to a Hadoop cluster. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-843: Attachment: NUTCH-843.patch Updated patch that moves nutch.jar to lib/ for the local runtime. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch, NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Parse-tika ignores too much data...
Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a body OR a frameset, but not both. -- Ken On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in head section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Parse-tika ignores too much data...
Hi Ken, Thank you for your comments and analysis. We should probably modify the HTMLHandler so that it does not discard a frameset because of the bodylevel being equal to 0. I suggested earlier on the Tika list having a mechanism for specifying a custom handler via the Context, that would give us the option in Nutch to implement the logic we want i.e. ignore the body level if we want to. Thanks J. On 7 July 2010 21:32, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a body OR a frameset, but not both. -- Ken On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in head section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
Re: Classifying pages on Nutch: plugins?
Julien, I'm in Luan's project too. I'd like to know if you have examples of the API use, or a documentation. I've seen the PDF at DigitalPeeble's site but couldn't get how to use it. Also, by downloading the project from Google Code's SVN, I saw the JUnit's test, but the main test (for me classifyTest) needs 2 files as input. So I'm puzzled. The libsvm file is just one, isn't it? Which files should I use as input to fileSubj and fileObj??? Thanks. Daniel Gimenes -- View this message in context: http://lucene.472066.n3.nabble.com/Classifying-pages-on-Nutch-plugins-tp946215p950512.html Sent from the Nutch - Dev mailing list archive at Nabble.com.