[Nutch Wiki] Trivial Update of PluginCentral by AlexM c

2010-07-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PluginCentral page has been changed by AlexMc.
The comment on this change is: typo.
http://wiki.apache.org/nutch/PluginCentral?action=diffrev1=61rev2=62

--

   * [[WritingPluginExample-0.9]] - Step-by-step example of how to write a 
plugin for the current development.
   * WritingPluginExample - A step-by-step example of how to write a plugin for 
the 0.7 branch. - updated by LucasBoullosa
   * [[http://wiki.media-style.com/display/nutchDocu/Write+a+plugin|Writing 
Plugins]] - by Stefan
-  * 
[[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example
 of writing a custom plugin] by Sujitpal
+  * 
[[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example
 of writing a custom plugin]] by Sujitpal
   * 
[[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a 
plugin to add dates]] by Ryan Pfister
  
  == Plugins that Come with Nutch (0.9) ==


[jira] Created: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)
Separate the build and runtime environments
---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Currently there is no clean separation of source, build and runtime artifacts. 
On one hand, it makes it easier to get started in local mode, but on the other 
hand it makes the distributed (or pseudo-distributed) setup much more 
challenging and tricky. Also, some resources (config files and classes) are 
included several times on the classpath, they are loaded under different 
classloaders, and in the end it's not obvious what copy and why takes 
precedence.

Here's an example of a harmful unintended behavior caused by this mess: Hadoop 
daemons (jobtracker and tasktracker) will get conf/ and build/ on their 
classpath. This means that a task running on this cluster will have two copies 
of resources from these locations - one from the inherited classpath from 
tasktracker, and the other one from the just unpacked nutch.job file. If these 
two versions differ, only the first one will be loaded, which in this case is 
the one taken from the (unpacked) conf/ and build/ - the other one, from within 
the nutch.job file, will be ignored.

It's even worse when you add more nodes to the cluster - the nutch.job will be 
shipped to the new nodes as a part of each task setup, but now the remote 
tasktracker child processes will use resources from nutch.job - so some tasks 
will use different versions of resources than other tasks. This usually leads 
to a host of very difficult to debug issues.

This issue proposes then to separate these environments into the following 
areas:

* source area - i.e. our current sources. Note that bin/ scripts will belong to 
this category too, so there will be no top-level bin/. nutch-default.xml 
belongs to this category too. Other customizable files can be moved to src/conf 
too, or they could stay in top-level conf/ as today, with a README that 
explains that changes made there take effect only after you rebuild the job jar.

* build area - contains build artifacts, among them the nutch.job jar.

* runtime (or deploy) area - this area contains all artifacts needed to run 
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
(installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are 
already included in the job jar. These resources can be copied directly to the 
master Hadoop node.

For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the 
plugins/ directory be unpacked from the job jar. And we need the hadoop libs to 
run in the local mode. We may later on refine this local setup to something 
like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which 
actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885967#action_12885967
 ] 

Chris A. Mattmann commented on NUTCH-843:
-

Super +1 

I've wanted to do something like this for a looong time 
http://markmail.org/thread/osmfz6pknr4n4unf

;)

Let me think about the deployment structure a little bit and comment back on 
this issue...

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 

 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-843:


Attachment: NUTCH-843.patch

This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and 
/runtime/local areas, populated with the right pieces. bin/nutch has been 
modified to work correctly in both cases.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler

Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in head section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken

On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:


Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886012#action_12886012
 ] 

Chris A. Mattmann commented on NUTCH-843:
-

Hey Andrzej:

Wouldn't my proposed deployment structure in theory be equivalent to say 
creating a .job file as you proposed above? You can think of the proposed dir 
structure as an exploded version of the unpacked .job?

Cheers,
Chris


 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015
 ] 

Andrzej Bialecki  commented on NUTCH-843:
-

We need to create the job file anyway. Actually, the patch I attached does 
something like this for the local setup (lib/ is flattened), but still I would 
argue for setting up two areas, /runtime/deploy and /runtime/local - it's 
painfully obvious then what parts you need to deploy to a Hadoop cluster.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-843:


Attachment: NUTCH-843.patch

Updated patch that moves nutch.jar to lib/ for the local runtime.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch, NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler

Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description  
of one of the cases found by Andrzej. There seems to be something  
very wrong with the way body is handled, we also saw cases were it  
was twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly  
broken, in that you can either have a body OR a frameset, but not  
both.


-- Ken


On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote:
Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in head section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken


On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche
Hi Ken,

Thank you for your comments and analysis. We should probably modify the
HTMLHandler so that it does not discard a  frameset because of the bodylevel
being equal to 0. I suggested earlier on the Tika list having a mechanism
for specifying a custom handler via the Context, that would give us the
option in Nutch to implement the logic we want i.e. ignore the body level if
we want to.

Thanks

J.

On 7 July 2010 21:32, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Julien,

 See https://issues.apache.org/jira/browse/TIKA-457 for a description of
 one of the cases found by Andrzej. There seems to be something very wrong
 with the way body is handled, we also saw cases were it was twice in the
 output.


 Don't know about the case of it appearing twice.

 But for the above issue, I added a comment. The test HTML is badly broken,
 in that you can either have a body OR a frameset, but not both.

 -- Ken

 On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Andrzej,

 I've got a old list of cases where Tika was not extracting links:

  - frame
  - iframe
  - img
  - map
  - object
  - link (only in head section)

 I worked around this in my crawling code, by directly processing the DOM,
 but I should roll this into Tika.

 If you have a list of problems with test docs, file a TIKA issue and I'll
 try to fix things up quickly.

 Thanks,

 -- Ken


 On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

  Hi,

 I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
 prepare the test DOM-s with Tika's HtmlParser.

 Results are not so good for some test cases... Even when using
 IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
 for some others (area) it drops the href. As a result, the number of valid
 outlinks collected with parse-tika is much smaller than with parse-html.

 I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
 a partial fix was applied to Tika 0.8, but still this won't handle the
 problems I mentioned above.

 Can we come up with a plan to address this? I'd rather switch completely
 to Tika-s HTML parsing, but at the moment we would lose too much useful
 data...

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


Re: Classifying pages on Nutch: plugins?

2010-07-07 Thread dgimenes

Julien,

I'm in Luan's project too.

I'd like to know if you have examples of the API use, or a documentation.
I've seen the PDF at DigitalPeeble's site but couldn't get how to use it. 

Also, by downloading the project from Google Code's SVN, I saw the JUnit's
test, but the main test (for me classifyTest) needs 2 files as input. So I'm
puzzled. The libsvm file is just one, isn't it? Which files should I use as
input to fileSubj and fileObj???

Thanks.
Daniel Gimenes
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Classifying-pages-on-Nutch-plugins-tp946215p950512.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.