date:20100707

[Nutch Wiki] Trivial Update of PluginCentral by AlexM c

2010-07-07 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PluginCentral page has been changed by AlexMc.
The comment on this change is: typo.
http://wiki.apache.org/nutch/PluginCentral?action=diffrev1=61rev2=62

--

   * [[WritingPluginExample-0.9]] - Step-by-step example of how to write a 
plugin for the current development.
   * WritingPluginExample - A step-by-step example of how to write a plugin for 
the 0.7 branch. - updated by LucasBoullosa
   * [[http://wiki.media-style.com/display/nutchDocu/Write+a+plugin|Writing 
Plugins]] - by Stefan
-  * 
[[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example
 of writing a custom plugin] by Sujitpal
+  * 
[[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html|Example
 of writing a custom plugin]] by Sujitpal
   * 
[[http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/|Writing a 
plugin to add dates]] by Ryan Pfister
  
  == Plugins that Come with Nutch (0.9) ==

[jira] Created: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

Separate the build and runtime environments
---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Currently there is no clean separation of source, build and runtime artifacts. 
On one hand, it makes it easier to get started in local mode, but on the other 
hand it makes the distributed (or pseudo-distributed) setup much more 
challenging and tricky. Also, some resources (config files and classes) are 
included several times on the classpath, they are loaded under different 
classloaders, and in the end it's not obvious what copy and why takes 
precedence.

Here's an example of a harmful unintended behavior caused by this mess: Hadoop 
daemons (jobtracker and tasktracker) will get conf/ and build/ on their 
classpath. This means that a task running on this cluster will have two copies 
of resources from these locations - one from the inherited classpath from 
tasktracker, and the other one from the just unpacked nutch.job file. If these 
two versions differ, only the first one will be loaded, which in this case is 
the one taken from the (unpacked) conf/ and build/ - the other one, from within 
the nutch.job file, will be ignored.

It's even worse when you add more nodes to the cluster - the nutch.job will be 
shipped to the new nodes as a part of each task setup, but now the remote 
tasktracker child processes will use resources from nutch.job - so some tasks 
will use different versions of resources than other tasks. This usually leads 
to a host of very difficult to debug issues.

This issue proposes then to separate these environments into the following 
areas:

* source area - i.e. our current sources. Note that bin/ scripts will belong to 
this category too, so there will be no top-level bin/. nutch-default.xml 
belongs to this category too. Other customizable files can be moved to src/conf 
too, or they could stay in top-level conf/ as today, with a README that 
explains that changes made there take effect only after you rebuild the job jar.

* build area - contains build artifacts, among them the nutch.job jar.

* runtime (or deploy) area - this area contains all artifacts needed to run 
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
(installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are 
already included in the job jar. These resources can be copied directly to the 
master Hadoop node.

For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the 
plugins/ directory be unpacked from the job jar. And we need the hadoop libs to 
run in the local mode. We may later on refine this local setup to something 
like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which 
actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885967#action_12885967
]

Chris A. Mattmann commented on NUTCH-843:
-

Super +1

I've wanted to do something like this for a looong time
http://markmail.org/thread/osmfz6pknr4n4unf

;)

Let me think about the deployment structure a little bit and comment back on
this issue...

Separate the build and runtime environments
---

Key: NUTCH-843
URL: https://issues.apache.org/jira/browse/NUTCH-843
Project: Nutch
Issue Type: Improvement
Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki

Currently there is no clean separation of source, build and runtime
artifacts. On one hand, it makes it easier to get started in local mode, but
on the other hand it makes the distributed (or pseudo-distributed) setup much
more challenging and tricky. Also, some resources (config files and classes)
are included several times on the classpath, they are loaded under different
classloaders, and in the end it's not obvious what copy and why takes
precedence.
Here's an example of a harmful unintended behavior caused by this mess:
Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on
their classpath. This means that a task running on this cluster will have two
copies of resources from these locations - one from the inherited classpath
from tasktracker, and the other one from the just unpacked nutch.job file. If
these two versions differ, only the first one will be loaded, which in this
case is the one taken from the (unpacked) conf/ and build/ - the other one,
from within the nutch.job file, will be ignored.
It's even worse when you add more nodes to the cluster - the nutch.job will
be shipped to the new nodes as a part of each task setup, but now the remote
tasktracker child processes will use resources from nutch.job - so some tasks
will use different versions of resources than other tasks. This usually leads
to a host of very difficult to debug issues.
This issue proposes then to separate these environments into the following
areas:
* source area - i.e. our current sources. Note that bin/ scripts will belong
to this category too, so there will be no top-level bin/. nutch-default.xml
belongs to this category too. Other customizable files can be moved to
src/conf too, or they could stay in top-level conf/ as today, with a README
that explains that changes made there take effect only after you rebuild the
job jar.
* build area - contains build artifacts, among them the nutch.job jar.
* runtime (or deploy) area - this area contains all artifacts needed to run
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster
(installed from plain vanilla Hadoop release) this will be a {{/deploy}}
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are
already included in the job jar. These resources can be copied directly to
the master Hadoop node.
For a local setup (using LocalJobTracker) this will be a {{/runtime}}
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that
the plugins/ directory be unpacked from the job jar. And we need the hadoop
libs to run in the local mode. We may later on refine this local setup to
something like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar
(which actually would not be used in this case).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-843:

Attachment: NUTCH-843.patch

This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and
/runtime/local areas, populated with the right pieces. bin/nutch has been
modified to work correctly in both cases.

Separate the build and runtime environments
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler


Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in head section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken

On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:


Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886012#action_12886012
]

Chris A. Mattmann commented on NUTCH-843:
-

Hey Andrzej:

Wouldn't my proposed deployment structure in theory be equivalent to say
creating a .job file as you proposed above? You can think of the proposed dir
structure as an exploded version of the unpacked .job?

Cheers,
Chris

Separate the build and runtime environments
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015
]

Andrzej Bialecki commented on NUTCH-843:
-

We need to create the job file anyway. Actually, the patch I attached does
something like this for the local setup (lib/ is flattened), but still I would
argue for setting up two areas, /runtime/deploy and /runtime/local - it's
painfully obvious then what parts you need to deploy to a Hadoop cluster.

Separate the build and runtime environments
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-843:

Attachment: NUTCH-843.patch

Updated patch that moves nutch.jar to lib/ for the local runtime.

Separate the build and runtime environments
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler


Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description  
of one of the cases found by Andrzej. There seems to be something  
very wrong with the way body is handled, we also saw cases were it  
was twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly  
broken, in that you can either have a body OR a frameset, but not  
both.


-- Ken


On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote:
Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in head section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken


On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche

Hi Ken,

Thank you for your comments and analysis. We should probably modify the
HTMLHandler so that it does not discard a  frameset because of the bodylevel
being equal to 0. I suggested earlier on the Tika list having a mechanism
for specifying a custom handler via the Context, that would give us the
option in Nutch to implement the logic we want i.e. ignore the body level if
we want to.

Thanks

J.

On 7 July 2010 21:32, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Julien,

 See https://issues.apache.org/jira/browse/TIKA-457 for a description of
 one of the cases found by Andrzej. There seems to be something very wrong
 with the way body is handled, we also saw cases were it was twice in the
 output.


 Don't know about the case of it appearing twice.

 But for the above issue, I added a comment. The test HTML is badly broken,
 in that you can either have a body OR a frameset, but not both.

 -- Ken

 On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Andrzej,

 I've got a old list of cases where Tika was not extracting links:

  - frame
  - iframe
  - img
  - map
  - object
  - link (only in head section)

 I worked around this in my crawling code, by directly processing the DOM,
 but I should roll this into Tika.

 If you have a list of problems with test docs, file a TIKA issue and I'll
 try to fix things up quickly.

 Thanks,

 -- Ken


 On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

  Hi,

 I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
 prepare the test DOM-s with Tika's HtmlParser.

 Results are not so good for some test cases... Even when using
 IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
 for some others (area) it drops the href. As a result, the number of valid
 outlinks collected with parse-tika is much smaller than with parse-html.

 I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
 a partial fix was applied to Tika 0.8, but still this won't handle the
 problems I mentioned above.

 Can we come up with a plan to address this? I'd rather switch completely
 to Tika-s HTML parsing, but at the moment we would lose too much useful
 data...

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Classifying pages on Nutch: plugins?

2010-07-07 Thread dgimenes


Julien,

I'm in Luan's project too.

I'd like to know if you have examples of the API use, or a documentation.
I've seen the PDF at DigitalPeeble's site but couldn't get how to use it. 

Also, by downloading the project from Google Code's SVN, I saw the JUnit's
test, but the main test (for me classifyTest) needs 2 files as input. So I'm
puzzled. The libsvm file is just one, isn't it? Which files should I use as
input to fileSubj and fileObj???

Thanks.
Daniel Gimenes
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Classifying-pages-on-Nutch-plugins-tp946215p950512.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[Nutch Wiki] Trivial Update of PluginCentral by AlexM c

[jira] Created: (NUTCH-843) Separate the build and runtime environments

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

Re: Parse-tika ignores too much data...

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

Re: Parse-tika ignores too much data...

Re: Parse-tika ignores too much data...

Re: Classifying pages on Nutch: plugins?

11 matches

Site Navigation

Mail list logo

Footer information