from:"Gal Nitzan"

RE: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-17 Thread Gal Nitzan

Thanks Do?acan, much obliged.

Gal.

 -Original Message-
 From: Do?acan G?ney (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Sunday, June 17, 2007 11:29 PM
 To: nutch-dev@lucene.apache.org
 Subject: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return
 ParseResult object instead of Parse object

  [ https://issues.apache.org/jira/browse/NUTCH-
 485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

 Do?acan G?ney resolved NUTCH-485.
 -

 Resolution: Fixed

 Committed in rev 548103 with two modifications:

 1) Fix whitespace issues.

 2) Original patch changed CCParseFilter to return the original parse
 result if CCParseFilter fails. Now if CCParseFilter fails with an
 exception, it returns an empty parse created from the exception.

  Change HtmlParseFilter 's to return ParseResult object instead of Parse
 object

 --

  Key: NUTCH-485
  URL: https://issues.apache.org/jira/browse/NUTCH-485
  Project: Nutch
   Issue Type: Improvement
   Components: fetcher
 Affects Versions: 1.0.0
  Environment: All
 Reporter: Gal Nitzan
 Assignee: Do?acan G?ney
  Fix For: 1.0.0

  Attachments: NUTCH-485.200705122151.patch, NUTCH-
 485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-
 485.200705131241.patch, NUTCH-485.200705140001.patch

  The current implementation of HtmlParseFilters.java doesn't allow a
 filter to add parse objects to the ParseResult object.
  A change to the HtmlParseFilter is needed which allows the filter to
 return ParseResult . and ofcourse a change to  HtmlParseFilters .

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.

RE: Lock file problems...

2007-06-07 Thread Gal Nitzan

I index directly to Solr.
It happened to me while 2 separate indexers accessed it directly. It seemed 
that the Lucene index stayed hung (that's why the lock exists) until I killed 
the process. After that I had to re-build the index, since I was afraid it got 
corrupted.

 -Original Message-
 From: Briggs [mailto:[EMAIL PROTECTED]
 Sent: Thursday, June 07, 2007 6:21 PM
 To: nutch-dev@lucene.apache.org
 Subject: Lock file problems...

 I am getting these lock file errors all over the place when indexing
 or even creating crawldbs.  It doesn't happen all the time, but
 sometimes it happens continuously.  So, I am not quite sure how these
 locks are getting in there, or why they aren't getting removed.

 I am not sure where to go from here.

 My current application is designed for crawling individual domains.
 So, I have multiple custom crawlers that work concurrently.  Each one
 basically does:

 1) fetch
 2) invert links
 3) segment merge
 4) index
 5) deduplicate
 6) merge indexes

 Though, I am still not 100% sure of what the indexes directory is truly
 for.

 java.io.IOException: Lock obtain timed out:
 [EMAIL PROTECTED]:/crawloutput/http$~~www.camlawblog.com/indexes/part-
 0/write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:69)
 at
 org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526)
 at
 org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551)
 at
 org.apache.nutch.indexer.DeleteDuplicates.reduce(DeleteDuplicates.java:414
 )
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)

 So, has anyone seen this come up on their own implementations?

[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-06 Thread Gal Nitzan (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501914
 ] 

Gal Nitzan commented on NUTCH-485:
--

Could one of the commiters, review this patch and maybe submit it please?


The patch tuches a few locations and with so many changes occuring right now 
it might be more complicated to fix it later...

Thanks

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
 NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705130928.patch

Following Andrzej advice, a much cleaner code :)

Attached...

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705131241.patch

Thanks Doğacan, I missed it :( 

Thanks to all reviewers.
 
Yet another patch...

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
 NUTCH-485.200705131241.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread Gal Nitzan (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705140001.patch

Thanks Doğacan for taking the time to review the code.

I agree with your comments on the usage. I run a video search and it sure going 
to help. The ability to discover and add content on the fly to the segment 
while parsing is a functionality long awaited and it all made possible after 
NUTCH-443... :)


And yet one more update with a better description in javadoc and some fixes to 
indentation.

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch, 
 NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, 
 NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Site nightly API link is broken

2007-05-12 Thread Gal Nitzan

Hi,

The link http://lucene.apache.org/nutch/nutch-nightly/docs/api/index.html is 
broken.

RE: Site nightly API link is broken

2007-05-12 Thread Gal Nitzan

Truly sorry but I don't know where it should point to.

 -Original Message-
 From: Sami Siren [mailto:[EMAIL PROTECTED]
 Sent: Saturday, May 12, 2007 11:05 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: Site nightly API link is broken

 Gal Nitzan wrote:
  Hi,

  The link http://lucene.apache.org/nutch/nutch-
 nightly/docs/api/index.html is
  broken.

 Can you submit a patch (the xml files are under src/site).

 --
  Sami Siren

[jira] Created: (NUTCH-484) Nutch Nightly API link is broken in site

2007-05-12 Thread Gal Nitzan (JIRA)

Nutch Nightly API link is broken in site


 Key: NUTCH-484
 URL: https://issues.apache.org/jira/browse/NUTCH-484
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
Priority: Trivial
 Fix For: 1.0.0


The Nightly API link is broken

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-12 Thread Gal Nitzan (JIRA)

Change HtmlParseFilter 's to return ParseResult object instead of Parse object
--

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0


The current implementation of HtmlParseFilters.java doesn't allow a filter to 
add parse objects to the ParseResult object.

A change to the HtmlParseFilter is needed which allows the filter to return 
ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-12 Thread Gal Nitzan (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gal Nitzan updated NUTCH-485:
-

Attachment: NUTCH-485.200705122151.patch

Attached patch for this issue.

Comments are welcome.

This patch tuches a few plugins, please review 

Thanks,

Gal

 Change HtmlParseFilter 's to return ParseResult object instead of Parse object
 --

 Key: NUTCH-485
 URL: https://issues.apache.org/jira/browse/NUTCH-485
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Gal Nitzan
 Fix For: 1.0.0

 Attachments: NUTCH-485.200705122151.patch


 The current implementation of HtmlParseFilters.java doesn't allow a filter to 
 add parse objects to the ParseResult object.
 A change to the HtmlParseFilter is needed which allows the filter to return 
 ParseResult . and ofcourse a change to  HtmlParseFilters .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Why not make SOLR the Nutch SE

2007-02-22 Thread Gal Nitzan

Hi,

Since I ran into SOLR the other day I was wandering why can't we join forces 
between the two projects?

Both projects complement to each other.

Any thoughts?

Gal.

RE: Injector checking for other than STATUS_INJECTED

2007-02-14 Thread Gal Nitzan

Hi Andrzej,

Does it mean that when you inject an existing (in crawldb) a URL it changes
its status to STATUS_DB_UNFETCHED?

Gal

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 8:47 AM
To: nutch-dev@lucene.apache.org
Subject: Re: Injector checking for other than STATUS_INJECTED

[EMAIL PROTECTED] wrote:
 Hi All,

 I think I am missing something.  In the Injector reduce code we have the
 following.

 
 while (values.hasNext()) {
   CrawlDatum val = (CrawlDatum)values.next();
   if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
 injected = val;
 injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
   } else {
 old = val;
   }
 }

 CrawlDatum res = null;
 if (old != null) res = old; // don't overwrite existing value
 else res = injected;
 

 Basically if it is not just injected then don't overwrite.  But I am not
 seeing where the input could be such that the CrawlDatum wasn't just
 injected and could have previous values.  Is this just in case someone
 uses the Injector as a Reducer and not a Mapper or am I missing how this
 condition can occur.
   

This handles an important case, when you inject URLs that already exist 
in the DB - then you have both the old value and the newly created value 
under the same key. In previous versions of Injector CrawlDatum-s for 
such URLs could be overwritten with new values, and you could lose 
valuable metadata accumulated in old values.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-13 Thread Gal Nitzan

Thanks Dennis, it seems it did the trick. Not sure totally, but so it seems
:)

Gal.

-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 13, 2007 11:09 PM
To: nutch-dev@lucene.apache.org
Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

Actually I take it back.  I don't think it is the same problem but I do 
think it is the right solution.

Dennis Kubes

Dennis Kubes wrote:
 This has to do with HADOOP-964.  Replace the jar files in your Nutch 
 versions with the most recent versions from Hadoop.  You will also need 
 to apply NUTCH-437 patch to get Nutch to work with the most recent 
 changes to the Hadoop codebase.

 Dennis Kubes

 Gal Nitzan wrote:
 Hi,

 Does anybody uses Nutch trunk?

 I am running nutch 0.9 and unable to fetch.

 after 50-60K urls I get NPE in
 org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time.

 I was wandering if anyone have a work around or maybe something is 
 wrong with
 my setup.

 I have opened a new issue in jira
 http://issues.apache.org/jira/browse/hadoop-1008 for this.

 Any clue?

 Gal

RE: hadoop-site.xml - absolute Path

2007-02-12 Thread Gal Nitzan

Hi Tobias,

The property should go in nutch-site.xml and you can see a sample for it in
nutch-default.xml

HTH,

Gal

-Original Message-
From: Tobias Zahn [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 13, 2007 12:30 AM
To: nutch-dev@lucene.apache.org
Subject: hadoop-site.xml - absolute Path

Hello out there,
sorry for mailing to this list another time. I'm not sure if I'm not
working carefully enough or something, but I'm facing even more problems.

I put a new property in conf/hadoop-site.xml, according to the examples
in hadoop-default.xml. The new property contains the path to a
configuration file for a plugin.
In that entry occurs:
2007-02-12 22:38:00,246 FATAL api.RegexURLFilterBase - Can't find
resource: $CORRECT-AND-EXISTING-PATH

No I wonder, if:
1) I can't extend api.RegexURLFilterBase and use another config file or
something similar
2) I can't use an absolute path for my properties.

It would be great if anyone is interested in that plugin and would like
to help me finding my errors. Please contact me, I'll mail you the
source (something around 100lines).

[The plugin will make it possible to index only some files, according to
an regex file - similar to urlfilter-regex].

Best regards,
Tobias Zahn

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Gal Nitzan (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471747
 ] 

Gal Nitzan commented on NUTCH-443:
--

Actually, I have tested Rome after feedparser failed with OutOfMemoy. Rome has 
the same problem as feedparser, both convert the feed to jdom first :(. I had 
to write my own implementation for rss parser with Stax.

Not Rome and neither feedparser could handle a 100K items feed, which isn't 
(probably) the common use case however it is not that far fetched use case.

HTH

Gal.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

NPE while fetching

2007-02-07 Thread Gal Nitzan

Hi,

I experience NPE while fetching I use Nutch trunk (a week ago) with Hadoop
0.11.1


java.lang.NullPointerException
at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:
2392)
at
org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2087)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:498
)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372)


Any pointers to the cause?

Thanks,

Gal.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Gal Nitzan

Hi,

IMO it should stay the same.

URL as the key and in the filter each item link element becomes the key.

I will be happy to convert the current parse-rss filter to the suggested
implementation.

Gal.

-- Original Message --
Received: Tue, 06 Feb 2007 10:36:03 AM IST
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

 Hi,
 
 Doug Cutting wrote:
  Doğacan Güney wrote:
  I think it would make much more sense to change parse plugins to take
  content and return Parse[] instead of Parse.
 
  You're right.  That does make more sense.
 
 OK, then should I go forward with this and implement something?   This
 should be pretty easy,
 though I am not sure what to give as keys to a Parse[].
 
 I mean, when getParse returned a single Parse, ParseSegment output them
 as url, Parse. But, if getParse
 returns an array, what will be the key for each element?
 
 Something like url#i, Parse[i] may work, but this may cause problems
 in dedup(for example,
 assume we fetched the same rss feed twice, and indexed them in different
 indexes. Two version's url#0 may be
 different items but since they have the same key, dedup will delete the
 older).
 
 --
 Doğacan Güney
 
 
  Doug

Generator.java bug?

2007-02-02 Thread Gal Nitzan

Hi,

 

After many failures of generate Generator: 0 records selected for fetching,
exiting ... I made a post about it a few days back.

 

I narrowed down to the following function:

 

public Path generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean force)

 

in the following if:  if (readers == null || readers.length == 0 ||
!readers[0].next(new FloatWritable()))

 

 

It turns out that the: !readers[0].next(new FloatWritable()) is the
culprit.

 

 

Gal

RE: Generator.java bug?

2007-02-02 Thread Gal Nitzan


PS.

In the following code:

if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
  LOG.warn(Generator: 0 records selected for fetching, exiting ...);
  LockUtil.removeLockFile(fs, lock);
  fs.delete(tempDir);
  return null;
}

 There is no need for the if here 
if (readers!=null)
  for (int i = 0; i  readers.length; i++) readers[i].close();

-Original Message-
From: Gal Nitzan [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 02, 2007 1:56 PM
To: nutch-dev@lucene.apache.org
Subject: Generator.java bug?

Hi,

 

After many failures of generate Generator: 0 records selected for fetching,
exiting ... I made a post about it a few days back.

 

I narrowed down to the following function:

 

public Path generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean force)

 

in the following if:  if (readers == null || readers.length == 0 ||
!readers[0].next(new FloatWritable()))

 

 

It turns out that the: !readers[0].next(new FloatWritable()) is the
culprit.

 

 

Gal

RE: Generator.java bug?

2007-02-02 Thread Gal Nitzan

Hi Andrzej,

Well on my system the list does contains urls and the fetcher does fetch it
correctly, however if I keep that test in the if it will report the list
is empty.

I am not sure but maybe the first value is not a FloatWritable or maybe
something else?

Thanks,

Gal



-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 02, 2007 3:28 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Generator.java bug?

Gal Nitzan wrote:
 Hi,

  

 After many failures of generate Generator: 0 records selected for
fetching,
 exiting ... I made a post about it a few days back.

  

 I narrowed down to the following function:

  

 public Path generate(Path dbDir, Path segments, int numLists, long topN,
 long curTime, boolean filter, boolean force)

  

 in the following if:  if (readers == null || readers.length == 0 ||
 !readers[0].next(new FloatWritable()))

  

  

 It turns out that the: !readers[0].next(new FloatWritable()) is the
 culprit.
   

Well, this condition simply checks if the result is not empty. When we 
open Reader[] on a SequenceFile, each reader corresponds to a 
part-x. There must be at least one part, so we use the one at index 
0. If we cannot retrieve at least one entry from it, then it logically 
follows that the file is empty, and we bail out.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread Gal Nitzan


Hi Chris,

I'm sorry I wasn't clear enough. What I mean is that in the current 
implementation:

1. The RSS (channels, items) page ends up as one Lucene document in the index.
2. Indeed the links are extracted and each item link will be fetched in the 
next fetch as a separate page and will end up as one Lucene document.

IMHO the data that is needed i.e. the data that will be fetched in the next 
fetch process is already available in the item element. Each item element 
represents one web resource. And there is no reason to go to the server and 
re-fetch that resource.

Another issue that arises from rss feeds is that once the feed page is fetched 
you can not re-fetch it until its time to fetch expired. The feeds TTL is 
usually very short. Since for now in Nutch, all pages created equal :) it is 
one more thing to think about.

HTH,

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 7:01 PM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse it in the next fetch phase. Well, there are 2 options here for
what you refer to as it:

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi,
 
 Many sites provide RSS feeds for several reasons, usually to save bandwidth,
 to give the users concentrated data and so forth.
 
 Some of the RSS files supplied by sites are created specially for search
 engines where each RSS item represent a web page in the site.
 
 IMHO the only thing missing in the parse-rss plugin is storing the data in
 the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new
 flag to CrawlDatum, that would flag the URL as parsable not fetchable?
 
 Just my two cents...
 
 Gal.
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 31, 2007 8:44 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
   With the explanation that you give below, it seems like parse-rss as it
 exists would address what you are trying to do. parse-rss parses an RSS
 channel as a set of items, and indexes overall metadata about the RSS file,
 including parse text, and index data, but it also adds each item (in the
 channel)'s URL as an Outlink, so that Nutch will process those pieces of
 content as well. The only thing that you suggest below that parse-rss
 currently doesn't do, is to allow you to associate the metadata fields
 category:, and author: with the item Outlink...
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:
 
 thx for ur reply .
 mybe i didn't tell clearly .
  I want to index the item as a
 individual page .then when i search the some
 thing for example nutch-open
 source, the nutch return a hit which contain
 
title : nutch-open source
 
 description : nutch nutch nutch nutch  nutch
url :
 http://lucene.apache.org/nutch
category : news
   author  : kauu
 
 so , is
 the plugin parse-rss can satisfy what i need?
 
 item
 titlenutch--open
 source/title
description
 
nutch nutch nutch nutch
 nutch
 /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
 categorynews
 /category
 
 
 authorkauu/author
 
 
 
 On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:
 
 Hi there,
 
 I could most
 likely

RE: RSS-fecter and index individul-how can i realize this function

2007-01-31 Thread Gal Nitzan

Hi,

Many sites provide RSS feeds for several reasons, usually to save bandwidth, to 
give the users concentrated data and so forth.

Some of the RSS files supplied by sites are created specially for search 
engines where each RSS item represent a web page in the site.

IMHO the only thing missing in the parse-rss plugin is storing the data in 
the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new 
flag to CrawlDatum, that would flag the URL as parsable not fetchable? 

Just my two cents...

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 31, 2007 8:44 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲，导致多次航班延误 1
 月24日，几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道，迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





-- 
www.babatu.com

Generator: 0 records selected for fetching, exiting

2007-01-29 Thread Gal Nitzan

hi,

ENV: FC6 JVM 1.6 Nutch trunk with hadoop .10.1

when running generate I get the following msg:

Generator: 0 records selected for fetching, exiting

When using readdb there are unfetched urls

Statistics for CrawlDb: vcrawldb
TOTAL urls: 1525
retry 0:1525
min score:  0.0060
avg score:  0.166
max score:  1.195
status 1 (db_unfetched):1338
status 2 (db_fetched):  127
status 3 (db_gone): 60
CrawlDb statistics: done

Any idea?

RE: parse-rss make them items as different pages

2007-01-26 Thread Gal Nitzan

Hi Kauu,

The functionality you require doesn't exist in the current parse-rss plugin. I 
need the same functionality but it doesn't exist and I believe it's not a 
simple task.

The functionality required basically is to create a page in a segment for each 
item and the URL to the crawldb.

Since the data already exists in the item element there is no reason to fetch 
the page (item). After that the only thing left is to index it.

Any thoughts on how to achieve that goal?

Gal.






-Original Message-
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 26, 2007 4:17 AM
To: nutch-dev@lucene.apache.org
Subject: parse-rss make them items as different pages

i want to crawl the rss feeds and parse them ,then index them and at last
when search the content I just want that the hit just like an individual
page.


i don't know wether i tell u clearly.

item
title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
description暴风雪横扫欧洲，导致多次航班延误
1月24日，几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
据报道，迟来的暴风雪连续两天横扫中...
/description
linkhttp://news.sohu.com/20070125/n247833568.shtml/link
category搜狐焦点图新闻/category
author[EMAIL PROTECTED]/author
pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate

commentshttp://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments
/item

this one item in an rss file

i want nutch deal with an item like an individual page.

so i search something in this item,the nutch return it as a hit.

so ...
any one can tell me how to do about ?
any reply will be appreciated

-- 
www.babatu.com

record version mismatch occured

2007-01-26 Thread Gal Nitzan

Trying to mergesegs I get the following, any idea?


A record version mismatch occured. Expecting v4, found v5
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:147)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1
175)
at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1258)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRea
der.java:69)
at
org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerge
r.java:139)
at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:201)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)

RE: record version mismatch occured

2007-01-26 Thread Gal Nitzan

Thanks Sami,

By redo do you mean re-parse or re-fetch + re-parse

-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 26, 2007 10:49 PM
To: nutch-dev@lucene.apache.org
Subject: Re: record version mismatch occured

Gal Nitzan wrote:
 Got it. I used latest trunk for a few hours and it seems that it changed
the
 version of Crawldatum to ver. 5 :(

Earlier one left too early, one(ore more) of your segments has data
written with newer version. If you haven't updated crawldb then you just
need to redo that(those) segment(s).

--
 Sami Siren

java.io.FileNotFoundException: / (Is a directory)

2007-01-26 Thread Gal Nitzan



Just installed latest from trunk.

I run mergesegs and I get the following error in all tasks log files (I use
default log4j.properties):

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: / (Is a directory)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:177)
at java.io.FileOutputStream.(FileOutputStream.java:102)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
at
org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
pender.java:215)
at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
)
at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)
at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
a:654)
at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
a:612)
at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
ator.java:509)
at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
415)
at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
441)
at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
java:468)
at org.apache.log4j.LogManager.(LogManager.java:122)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces
sorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc
torAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.ja
va:529)
at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.ja
va:235)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370)
at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:59)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1346)
log4j:ERROR Either File or DatePattern options are not set for appender
[DRFA].

RE: updating index without refitting

2006-11-28 Thread Gal Nitzan

Hi,

You do not mention if the new field's data is stored as a metadata? Does the
value data being created during parse or is it added only during the index
phase?

If your new field is created during the parse process than you could delete
only the parse folders and run the parse process i.e. (delete segment/crawl
parse , segment/parse data , segment/parse text) and run bin/nutch parse
segment

Or if your field data is added during the index process than re-create your
index.

In any case it doesn't seem to me you would need to re-fetch.

HTH

Gal

-Original Message-
From: DS jha [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 28, 2006 4:11 PM
To: nutch-dev@lucene.apache.org
Subject: updating index without refetching

Hi All,

Is it possible to update the index without refetching everything?  I have
changed logic of one of my plugins (which also sets a custom field in the
index) - and I would like this field to get updated without refetching
everything - is it doable?


Thanks,

RE: Error with Hadoop-0.4.0

2006-07-10 Thread Gal Nitzan

To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.



-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 06, 2006 7:23 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Error with Hadoop-0.4.0

Jérôme Charron wrote:

 Hi,

 I encountered some problems with Nutch trunk version.
 In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
 1.5
 (more precisely since HADOOP-129 and File replacement by Path).
 Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).

--
 Sami Siren

RE: search speed

2006-06-15 Thread Gal Nitzan

Hi,

DFS is too slow for the search.

What we did, was extracted the segments to the local FS i.e. to the hard
disk. Each machine has 2X300GB HD in raid.

Bin/hadoop dfs -get index /nutch/index
Bin/hadoop dfs -get linkdb /nutch/linkdb
Bin/hadoop dfs -get segments /nutch/segments

When we run out of disk space for the segments on one web server, we add
another web server, use mergesegs to split the segments and use the
distributed search.

HTH


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 10:09 AM
To: nutch-dev@lucene.apache.org
Subject: search speed

I using dfs. My index contain 3706249 documents. Presently, searching for
occupies from 2 before 4 seconds (I test on query with 3 search term).
Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think
search is very slow now. 
We can make search faster? 
What factors influence on search speed?

RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Gal Nitzan

In my company we changed the default and many other probably did the same.
However, we must not ignore the behavior of the irresponsible users of
Nutch. And for that reason the use of the default must be blocked in code.

Just my 2 cents.


-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 9:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 


well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61

RE: NPE When using a merged segment

2006-05-30 Thread Gal Nitzan

I think it is a bug. It saves the old segment name instead of replacing it
with the new segment name

-Original Message-
From: Dominik Friedrich [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 29, 2006 7:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: NPE When using a merged segment

I have the same problem with a merged segment. I had a look with luke at 
the index and it seems that the indexer puts the old segment names in 
there instead of the name of the merged segment. I'm not sure if I did 
something wrong or if this is a bug.

Dominik

Gal Nitzan schrieb:
 Hi,

 I have built a new index based on the new segment only.

 -Original Message-
 From: Stefan Neufeind [mailto:[EMAIL PROTECTED] 
 Sent: Monday, May 29, 2006 10:03 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: NPE When using a merged segment

 Gal Nitzan wrote:

 Hi,

 After using mergesegs to merge all my segments to one segment only, I

 moved

 the new segment to segments.

 When accessing the web UI I get:

 java.lang.RuntimeException: java.lang.NullPointerException

org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:20

 3)
  org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
  org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:175)

 Hi,

 I'm not sure - but have you tried reindexing that new segment? To my
 understanding the index holds refereences to the segment (segment-name)
 - and in your case those are invalid. This would also explain the error
 you get (in call to getSummary) because the summary is fetched from the
 segment.

 If this works, then maybe you'll need to find a better way of cleaning
 up the index - not reindexing everything but maybe just rewriting the
 segmeent-names all into one or so.

 Feedback welcome.

 Good luck,
  Stefan

NPE When using a merged segment

2006-05-28 Thread Gal Nitzan

Hi,

After using mergesegs to merge all my segments to one segment only, I moved
the new segment to segments.

When accessing the web UI I get:

java.lang.RuntimeException: java.lang.NullPointerException

org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:20
3)
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:175)

Gal.

RE: Where exactly nutch scoring takes place ?

2006-05-26 Thread Gal Nitzan

Hi,

The scoring in Nutch-08 is done in a plugin: scoring-opic. It is called from
Indexr.java

HTH



-Original Message-
From: ahmed ghouzia [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 26, 2006 3:16 PM
To: nutch-user@lucene.apache.org; nutch-dev@incubator.apache.org
Subject: Where exactly nutch scoring takes place ?

I want to use nutch as an environment to test my proposed algorithm for web
mining

1- Where exactly does the nutch score take place ? in which packages or
files?

2- Can the LinkAnalysisTool be run at the intranet level?, some documents
mentioned that it can take place only at the whole web crawling level

3- what technologies and concepts that i must be familiar with to get into
nuch development?
is it only jsp, servlet ro anything else ?


-
Be a chatter box. Enjoy free PC-to-PC calls  with Yahoo! Messenger with
Voice.

[jira] Commented: (NUTCH-284) NullPointerException during index

2006-05-25 Thread Gal Nitzan (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413231 ] 

Gal Nitzan commented on NUTCH-284:
--

I just had somthing similar.

Try the following:

run ant on each of your tasktrackers machines:

% ant

than restart your nutch and try again.

I think there is a problem with the classpath

 NullPointerException during index
 -

  Key: NUTCH-284
  URL: http://issues.apache.org/jira/browse/NUTCH-284
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 For  quite a few this reduce  sort has been going on. Then it fails. What 
 could be wrong with this?
 060524 212613 reduce  sort
 060524 212614 reduce  sort
 060524 212615 reduce  sort
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212615 found resource common-terms.utf8 at 
 file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
 060524 212619 Optimizing index.
 060524 212619 job_jlbhhm
 java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
 at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
 at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

RE: error

2006-05-22 Thread Gal Nitzan

A new plugin was added to code base.

You need to add to tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
property plugin.includes a new entry summary-basic or summary-lucene.

HTH

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 22, 2006 11:39 AM
To: nutch-dev@lucene.apache.org
Subject: error

I updated any plugins... And now I get errors in tomcat log: 

May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init
SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin
(summary-basic), extension point: org.apache.nutch.searcher.Summarizer does
not exist.

How fix this problem?

[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Gal Nitzan (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412435 ] 

Gal Nitzan commented on NUTCH-271:
--

This functionality is already available in Nutch-0.8

 Meta-data per URL/site/section
 --

  Key: NUTCH-271
  URL: http://issues.apache.org/jira/browse/NUTCH-271
  Project: Nutch
 Type: New Feature

 Versions: 0.7.2
 Reporter: Stefan Neufeind


 We have the need to index sites and attach additional meta-data-tags to them. 
 Afaik this is not yet possible, or is there a workaround I don't see? What 
 I think of is using meta-tags per start-url, only indexing content below that 
 URL, and have the ability to limit searches upon those meta-tags. E.g.
 http://www.example1.com/something1/   - meta-tag companybranch1
 http://www.example2.com/something2/   - meta-tag companybranch2
 http://www.example3.com/something3/   - meta-tag companybranch1
 http://www.example4.com/something4/   - meta-tag companybranch3
 search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-271) Meta-data per URL/site/section

2006-05-18 Thread Gal Nitzan (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412436 ] 

Gal Nitzan commented on NUTCH-271:
--

Sorry for the short comment.

Actually the meta tags functionality is already available in the 0.8 version 
along with a CrawlDatum object.

You can build the required functionality just by developing plugins for parsing 
indexing and querying

HTH.

 Meta-data per URL/site/section
 --

  Key: NUTCH-271
  URL: http://issues.apache.org/jira/browse/NUTCH-271
  Project: Nutch
 Type: New Feature

 Versions: 0.7.2
 Reporter: Stefan Neufeind


 We have the need to index sites and attach additional meta-data-tags to them. 
 Afaik this is not yet possible, or is there a workaround I don't see? What 
 I think of is using meta-tags per start-url, only indexing content below that 
 URL, and have the ability to limit searches upon those meta-tags. E.g.
 http://www.example1.com/something1/   - meta-tag companybranch1
 http://www.example2.com/something2/   - meta-tag companybranch2
 http://www.example3.com/something3/   - meta-tag companybranch1
 http://www.example4.com/something4/   - meta-tag companybranch3
 search for everything in companybranch1 or across 1 and 3 or similar

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

RE: Proposal for Avoiding Content Generation Sites

2006-03-09 Thread Gal Nitzan

Actually there is a property in conf: generate.max.per.host

So if you add a message in Generator.java at the appropriate place... you
have what you wish...

Gal


-Original Message-
From: Rod Taylor [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 08, 2006 7:28 PM
To: Nutch Developer List
Subject: Proposal for Avoiding Content Generation Sites

We've indexed several content generation sites that we want to
eliminate. One had hundreds of thousands of sub-domains spread across
several domains (up to 50M pages in total). Quite annoying.

First is to allow for cleaning up.  This consists of a new option to
updatedb which can scrub the database of all URLs which no longer
match URLFilter settings (regex-urlfilter.txt). This allows a change in
the urlfilter to be reflected against Nutches current dataset, something
I think others have asked for in the past.

Second is to treat a subdomain as being in the same bucket as the domain
for the generator.  This means that *.domain.com or *.domain.co.uk would
create 2 buckets instead of one per hostname. There is a high likely
hood that sub-domains will be on the same servers as the primary domain
and should be rate-limited as such.  generate.max.per.host would work
more as generate.max.per.domain instead.


Third is ongoing detection. I would like to add a feature to Nutch which
could report anomalies during updatedb or generate. For example, if any
given domain.com bucket during generate is found to have more than 5000
URLs to be downloaded, it should be flagged for a manual review. Write a
record to a text file which can be read and picked up by a person to
confirm that we haven't gotten into a garbage content generation site.
If we are in a content generation site, the person would add this domain
to the urlfilter and the next updatedb would clean out all URLs from
that location.


Are there any thoughts or objections to this? One and 2 are pretty
straight forward. Detection is not so easy.

-- 
Rod Taylor [EMAIL PROTECTED]

Re: Unable to complete a full fetch, reason Child Error

2006-02-26 Thread Gal Nitzan

Still got the same...

I'm not sure if it is relevant to this issue but the call you added to
Fetcher.java: 

 job.setBoolean(mapred.speculative.execution, false);

Doesn't work. All task trackers still fetch together though I have only
3 sites in the fetchlist.

The task trackers fetch the same pages...

I have used latest build from hadoop trunk.

Gal.


On Fri, 2006-02-24 at 14:15 -0800, Doug Cutting wrote:
 Mike Smith wrote:
  060219 142408 task_m_grycae  Parent died.  Exiting task_m_grycae
 
 This means the child process, executing the task, was unable to ping its 
 parent process (the task tracker).
 
  060219 142408 task_m_grycae Child Error
  java.io.IOException: Task process exit with nonzero status.
  at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144)
  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97)
 
 And this means that the parent was really still alive, and has noticed 
 that the child killed itself.
 
 It would be good to know how the child failed to contact its parent.  We 
 should probably log a stack trace when this happens.  I just made that 
 change in Hadoop and will propagate it to Nutch.
 
 Doug

RE: Nutch Improvement - HTML Parser

2006-02-25 Thread Gal Nitzan

You can always implement your own parser.



On Sat, 2006-02-25 at 16:51 -0500, Fuad Efendi wrote:
 Let's do this, to create /to use existing/ low-level processing, I mean to
 use StartTag and EndTag (which could be different in case of malformed
 HTML), and to look at what is inside.
 
 In this case performance wil improve, and functionality, because we are not
 building DOM, and we are not trying to find and fix HTML errors. Of course
 our Tag class will have Attributes, and we will have StartTag, EndTag, etc.
 I call it low-level 'parsing'. Are we using DOM to parse RTF, PDF, XLS, TXT?
 Even inside existing parser we are using Perl5 to check some metadata, right
 before parsing.
 
 
 =
 Yes sure. I think everybody has already done such things at school...
 Building a DOM provide:
 1. a better parsing of malformed html documents (and there is a lot of
 malformed docs on the web)
 2. gives ability to extract meta-information such as creative commons
 license
 3. gives a high degree of extensibility (HtmlParser extension point) to
 extract some specific informations without parsing the document many times
 (for instance extracting technorati like tags, ...) and just providing a
 simple plugin.

Re: All tasktrackers access same site at the same time (hadoop) please help

2006-02-16 Thread Gal Nitzan

Thanks for the prompt reply.

I have updated Fetcher.java and hadoop.jar from trunk but I still get
the aforementioned behavior.



On Wed, 2006-02-15 at 15:02 -0800, Doug Cutting wrote:
 Gal Nitzan wrote:
  I noticed all tasktrackers are participating in the fetch.
  
  I have only one site in the injected seed file
  
  I have 5 tasktrackers all except one access the same site.
 
 I just fixed a bug related to this.  Please try updating.
 
 The problem was that MapReduce recently started supporting speculative 
 execution, where, if some tasks appear to be executing slowly and there 
 are idle nodes, then these tasks automatically are run in parallel on 
 another node, and the results of the first that finishes are used.  But 
 this is not appropriate for fetching.  So I just added a mechanism to 
 Hadoop to disable it and then disabled it in the Fetcher.
 
 Note also that the slaves file is now located in the conf/ directory, as 
 is a new file named hadoop-env.sh.  This contains all relevant 
 environment variables, so that we no longer have to rely on ssh's 
 environment passing feature.
 
 Doug

Unable to complete a full fetch, reason Child Error

2006-02-16 Thread Gal Nitzan

During fetch all tasktrackers aborting the fetch with:

task_m_b45ma2 Child Error
java.io.IOException: Task process exit with nonzero status.
at
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97)

Global locking

2006-02-16 Thread Gal Nitzan

I have implemented a down and dirty Global Locking:

I am currently testing it but I would like to get other people idea on
this:

I used RMI for this purpose:

A RMI server which implements two methods {
boolean lock(String urlString);
void unlock(String urlString);
}

the server holds a mapkey,val where key is an Integer(host hash) the
val is a very simplistic class:

public class LockObj {
  private int hash;
  private long start;
  private long timeout;
  private int max_locks;
  private int locks = 0;
  private Object sync_obj = new Object();

  public LockObj(int hash, long timeout, int max_locks) {
this.hash = hash;
this.timeout = timeout;
start = new Date().getTime();
this.max_locks = max_locks;
  }

  public synchronized boolean lock() {
boolean ret = false;

if (locks+1  max_locks) {
  synchronized(sync_obj) {
locks++;
  }
  ret = true;
}
return ret;
  }

  public synchronized void unlock() {
if (locks  0) {
  synchronized(sync_obj) {
locks--;
  }
}
  }

  public int locks() {
return locks;
  }

  // convert the host part of a url to hash
  // if url exception. use the string input for hash
  public static int make_hash(String urlString) {
URL url = null;
try {
  url = new URL(urlString);
} catch (MalformedURLException e) {
}

return (url==null ? urlString : url.getHost()).hashCode();
  }

  // check if this object timeout has reached.
  // later implement a listener event
  public boolean timeout_reached() {
long current = new Date().getTime();

return (current - start)  timeout;
  }

  // free all
  public void unlock_all() {
synchronized(sync_obj) {
  while (locks != 0)
locks--;
}
  }

  public int hash() {
return hash;
  }
}

not the prettiest thing but just finished the first barrier... it
worked!!!


I changed FetcherThread constructor to create an instance of
SyncManager.

And in also in the run method I try to get a lock on the host. If not
successful I add the url into a ListArraykey,datum for a later
processing...

I also changed generator to put each url into a separate array so all
fetchlists are even.

Would appreciate your comments and any way to improve.

The RMI is a little cumbersome but hay... for now it works for 5 task
trackers without a problem (so it seems) :)


Gal




On Wed, 2006-02-15 at 14:55 -0800, Doug Cutting wrote:
 Andrzej Bialecki wrote:
  (FYI: if you wonder how it was working before, the trick was to generate 
  just 1 split for the fetch job, which then lead to just one task being 
  created for any input fetchlist.
 
 I don't think that's right.  The generator uses setNumReduceTasks() to 
 the desired number of fetch tasks, to control how many host-disjoint 
 fetchlists are generated.  Then the fetcher does not permit input files 
 to be split, so that fetch tasks remain host-disjoint.  So lots of 
 splits can be generated, by default one per mapred.map.tasks, permitting 
 lots of parallel fetching.
 
 This should still work.  If it does not, I'd be interested to hear more 
 details.
 
 Doug

Re: Global locking

2006-02-16 Thread Gal Nitzan

well, at the moment it solve the problem I mentioned yesterday where all
tasktrackers will access the same site with hadoop. it seems that the
use of job.setBoolean(mapred.speculative.execution, false); didn't
help and I'm not sure why.

However, though it is one more software it removes the need for special
treatment for fetcher, i.e. special fetch lists built by the generator.
So now fetcher/tasktracker suppose to access politely to hosts but still
its list contains various hosts. Sometimes I noticed that generator
created a fetchlist where (only 2 hosts in the seed) were put in the
same fetchlist which made only one tasktracker work instead of two.

I'm sorry if It sound a little confusing :) or unreasonable... :)

Gal



On Thu, 2006-02-16 at 13:47 -0800, Doug Cutting wrote:
 Gal Nitzan wrote:
  I have implemented a down and dirty Global Locking:
   [ ... ]
  
  I changed FetcherThread constructor to create an instance of
  SyncManager.
  
  And in also in the run method I try to get a lock on the host. If not
  successful I add the url into a ListArraykey,datum for a later
  processing...
  
  I also changed generator to put each url into a separate array so all
  fetchlists are even.
 
 What problem does this fix?
 
 Doug

All tasktrackers access same site at the same time (hadoop) please help

2006-02-15 Thread Gal Nitzan

Hi,

Just installed 0-8 with hadoop from trunk.

I noticed all tasktrackers are participating in the fetch.

I have only one site in the injected seed file

I have 5 tasktrackers all except one access the same site.

I am using nu0.8 dev with hadoop.

Please, any idea?

Thanks.

[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml

2006-01-25 Thread Gal Nitzan (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12364010 ] 

Gal Nitzan commented on NUTCH-186:
--

After reading the code and I think I figured it... :)

The issue of the mapred-default.xml is totaly misleading.

Actualy : mapred.map.tasks and mapred.reduce.tasks properties does not have any 
effect when placed in mapred-default.xml (unless JobConf needs it which I 
didn´t check) because this file is loaded only when JobConf is constructed.
But tasktracker is looking for these properties in nutch-site and not in 
mapred-default.

If these properties does not exists in nutch-site.xm with the correct values 
for your system, these values will be picked from nutch-defaul.xml.

Further, I am not sure that nutch-site.xml overiding everything should be the 
correct behavior. Most users knows that nutch-site.xml overides nutch-default 
but I think we should leave it up to them the option to override nutch-site and 
it  will be a good start into breaking configuration to parts (ndfs and mapred 
are going to be seperated from nutch)...

Gal

 mapred-default.xml is over ridden by nutch-site.xml
 ---

  Key: NUTCH-186
  URL: http://issues.apache.org/jira/browse/NUTCH-186
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
  Environment: All
 Reporter: Gal Nitzan
 Priority: Minor
  Attachments: myBeautifulPatch.patch, myBeautifulPatch.patch

 If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and 
 also in mapred-default.xml the definitions from nutch-site.xml are those that 
 will take effect.
 So if a user mistakenly copies those entries into nutch-site.xml from the 
 nutch-default.xml she will not understand what happens.
 I would like to propose removing these setting completely from the 
 nutch-default.xml and put it only in mapred-default.xml where it belongs.
 I will be happy to supply a patch for that  if the proposition accepted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml

2006-01-24 Thread Gal Nitzan (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12363903 ] 

Gal Nitzan commented on NUTCH-186:
--

ok, JobConf extends NutchConf and in the (JobConf) constructor it adds the 
mapred-default.xml resource.

the call to add resource in NutchConf actually inserts any resource file before 
the nutch-site.xml so there is no way to override it. look at the code at the 
bottom.

the only thing required is to change line 85 in NutchConf to be:

resourceNames.add(name); // add resouce name

instead of

resourceNames.add(resourceNames.size()-1, name); // add second to last

and add one more line to JobConf constructor

addConfResource(mapred-site.xml);


This way nutch-site.xml overides nutch-default.xml but other added resources 
can override nutch-site.xml which in my opinion is reasonable.

If acceptable I will create the patch.


- current code in ButchConf.Java 
-
  public synchronized void addConfResource(File file) {
addConfResourceInternal(file);
  }
  private synchronized void addConfResourceInternal(Object name) {
resourceNames.add(resourceNames.size()-1, name); // add second to last
properties = null;// trigger reload
  }


 mapred-default.xml is over ridden by nutch-site.xml
 ---

  Key: NUTCH-186
  URL: http://issues.apache.org/jira/browse/NUTCH-186
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
  Environment: All
 Reporter: Gal Nitzan
 Priority: Minor


 If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and 
 also in mapred-default.xml the definitions from nutch-site.xml are those that 
 will take effect.
 So if a user mistakenly copies those entries into nutch-site.xml from the 
 nutch-default.xml she will not understand what happens.
 I would like to propose removing these setting completely from the 
 nutch-default.xml and put it only in mapred-default.xml where it belongs.
 I will be happy to supply a patch for that  if the proposition accepted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml

2006-01-24 Thread Gal Nitzan (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-186?page=all ]

Gal Nitzan updated NUTCH-186:
-

Attachment: myBeautifulPatch.patch

the patch attached

 mapred-default.xml is over ridden by nutch-site.xml
 ---

  Key: NUTCH-186
  URL: http://issues.apache.org/jira/browse/NUTCH-186
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
  Environment: All
 Reporter: Gal Nitzan
 Priority: Minor
  Attachments: myBeautifulPatch.patch

 If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and 
 also in mapred-default.xml the definitions from nutch-site.xml are those that 
 will take effect.
 So if a user mistakenly copies those entries into nutch-site.xml from the 
 nutch-default.xml she will not understand what happens.
 I would like to propose removing these setting completely from the 
 nutch-default.xml and put it only in mapred-default.xml where it belongs.
 I will be happy to supply a patch for that  if the proposition accepted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Gal Nitzan (JIRA)

Proposition: Enable Nutch to use a parser plugin not just based on content type
---

 Key: NUTCH-179
 URL: http://issues.apache.org/jira/browse/NUTCH-179
 Project: Nutch
Type: Improvement
  Components: fetcher  
Versions: 0.8-dev
Reporter: Gal Nitzan


Somtime there are requirements of the real world (usually your boss) where a 
special parse is required for a certain site.

Sample: I am required to crawl certain sites where some of them are partners 
sites. when fetching from the partners site I need to look for certain entries 
in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy 
implementation for others if ParserFactory could use NutchConf to check for 
certain properties and if matched to use the correct plugin based on the url 
and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Gal Nitzan (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]

Gal Nitzan updated NUTCH-179:
-

Description: 
Somtime there are requirements of the real world (usually your boss) where a 
special parse is required for a certain site. Though the content type is 
text/html, a specialized parser is needed.

Sample: I am required to crawl certain sites where some of them are partners 
sites. when fetching from the partners site I need to look for certain entries 
in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy 
implementation for others if ParserFactory could use NutchConf to check for 
certain properties and if matched to use the correct plugin based on the url 
and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.

  was:
Somtime there are requirements of the real world (usually your boss) where a 
special parse is required for a certain site.

Sample: I am required to crawl certain sites where some of them are partners 
sites. when fetching from the partners site I need to look for certain entries 
in the text and boost the score.

Currently the ParserFactory looks for a plugin based only on the content type.

Facing this issue myself I noticed that it would give a very easy 
implementation for others if ParserFactory could use NutchConf to check for 
certain properties and if matched to use the correct plugin based on the url 
and not just the content type.

The implementation shouldn be to complicated.

Looking to hear more ideas.


 Proposition: Enable Nutch to use a parser plugin not just based on content 
 type
 ---

  Key: NUTCH-179
  URL: http://issues.apache.org/jira/browse/NUTCH-179
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.8-dev
 Reporter: Gal Nitzan


 Somtime there are requirements of the real world (usually your boss) where 
 a special parse is required for a certain site. Though the content type is 
 text/html, a specialized parser is needed.
 Sample: I am required to crawl certain sites where some of them are partners 
 sites. when fetching from the partners site I need to look for certain 
 entries in the text and boost the score.
 Currently the ParserFactory looks for a plugin based only on the content type.
 Facing this issue myself I noticed that it would give a very easy 
 implementation for others if ParserFactory could use NutchConf to check for 
 certain properties and if matched to use the correct plugin based on the url 
 and not just the content type.
 The implementation shouldn be to complicated.
 Looking to hear more ideas.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Gal Nitzan

Thanks, I was checking something with the default from jdk...

On Tue, 2006-01-10 at 11:06 +0100, Jérôme Charron wrote:
  the following code would fail in case the meta tags are in upper case
 
  Node nameNode = attrs.getNamedItem(name);
  Node equivNode = attrs.getNamedItem(http-equiv);
  Node contentNode = attrs.getNamedItem(content);
 
 This code works well, because Nutch HTML Parser uses Xerces implementation
 HTMLDocumentImpl object that lowercased attributes (instead of elements
 names that are uppercased).
 For consistency and to decouple a little Nutch HTML Parser and Xerces
 implementation, I suggest to change these lines by something like:
 Node nameNode = null;
 Node equivNode = null;
 Node contentNode = null;
 for (int i=0; iattrs.getLength(); i++) {
   Node attr = attrs.item(i);
   String attrName = attr.getNodeName().toLowerCase();
   if (attrName.equals(name)) {
 nameNode = attr;
   } else if (attrName.equals(http-equiv)) {
 equivNode = attr;
   } else if (attrName.equals(content)) {
 contentNode = attr;
   }
 }
 
 
 Jérôme
 
 
 --
 http://motrech.free.fr/
 http://www.frutch.org/

fetch of XXX failed with: java.lang.ClassCastException: java.util.ArrayList

2006-01-10 Thread Gal Nitzan


Hi,

I traced it to ParseData line 147.

  UTF8.writeString(out, (String) e.getKey());
  UTF8.writeString(out, (String) e.getValue());


 it seems that Set-Cookie key comes with a ArrayList value?

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Gal Nitzan

Because I needed to add two more fields from the meta tags in the html
page I have revised some of the code in HTMLMetaProcessor and in
DOMContentUtils.

I believe it to be a little more generic than the existing code (look at
DOMContentUtils.GetMetaAttributes) and from the sample here from Jérôme
since the existing code can handle only http-equiv or name...

Since I am not too familiar with svn. I paste it down this email, it
might be useful to someone.

On Tue, 2006-01-10 at 08:48 -0800, Doug Cutting wrote:
 Jérôme Charron wrote:
  For consistency and to decouple a little Nutch HTML Parser and Xerces
  implementation, I suggest to change these lines by something like:
  Node nameNode = null;
  Node equivNode = null;
  Node contentNode = null;
  for (int i=0; iattrs.getLength(); i++) {
Node attr = attrs.item(i);
String attrName = attr.getNodeName().toLowerCase();
if (attrName.equals(name)) {
  nameNode = attr;
} else if (attrName.equals(http-equiv)) {
  equivNode = attr;
} else if (attrName.equals(content)) {
  contentNode = attr;
}
  }
 
 +1
 


/**
 * Copyright 2005 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the License);
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.nutch.parse.html;

import java.net.URL;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;

import org.apache.nutch.parse.Outlink;

import org.w3c.dom.*;

/**
 * A collection of methods for extracting content from DOM trees.
 * p/
 * This class holds a few utility methods for pulling content out of
 * DOM nodes, such as getOutlinks, getText, etc.
 */
public class DOMContentUtils {

  public static class LinkParams {
public String elName;
public String attrName;
public int childLen;

public LinkParams(String elName, String attrName, int childLen) {
  this.elName = elName;
  this.attrName = attrName;
  this.childLen = childLen;
}

public String toString() {
  return LP[el= + elName + ,attr= + attrName + ,len= +
childLen + ];
}
  }

  public static HashMap linkParams = new HashMap();

  static {
linkParams.put(a, new LinkParams(a, href, 1));
linkParams.put(area, new LinkParams(area, href, 0));
linkParams.put(form, new LinkParams(form, action, 1));
linkParams.put(frame, new LinkParams(frame, src, 0));
linkParams.put(iframe, new LinkParams(iframe, src, 0));
linkParams.put(script, new LinkParams(script, src, 0));
linkParams.put(link, new LinkParams(link, href, 0));
linkParams.put(img, new LinkParams(img, src, 0));
  }

  /**
   * This method takes a [EMAIL PROTECTED] StringBuffer} and a DOM [EMAIL 
PROTECTED] Node},
   * and will append all the content text found beneath the DOM node to
   * the codeStringBuffer/code.
   * p/
   * p/
   * p/
   * If codeabortOnNestedAnchors/code is true, DOM traversal will
   * be aborted and the codeStringBuffer/code will not contain
   * any text encountered after a nested anchor is found.
   * p/
   * p/
   *
   * @return true if nested anchors were found
   */
  public static final boolean getText(StringBuffer sb, Node node,
  boolean abortOnNestedAnchors) {
if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
  return true;
}
return false;
  }


  /**
   * This is a convinience method, equivalent to [EMAIL PROTECTED]
   * #getText(StringBuffer,Node,boolean) getText(sb, node, false)}.
   */
  public static final void getText(StringBuffer sb, Node node) {
getText(sb, node, false);
  }

  // returns true if abortOnNestedAnchors is true and we find nested 
  // anchors
  private static final boolean getTextHelper(StringBuffer sb, Node node,
 boolean
abortOnNestedAnchors,
 int anchorDepth) {
if (script.equalsIgnoreCase(node.getNodeName())) {
  return false;
}
if (style.equalsIgnoreCase(node.getNodeName())) {
  return false;
}
if (abortOnNestedAnchors 
a.equalsIgnoreCase(node.getNodeName())) {
  anchorDepth++;
  if (anchorDepth  1)
return true;
}
if (node.getNodeType() == Node.COMMENT_NODE) {
  return false;
}
if (node.getNodeType() == Node.TEXT_NODE) {
  // cleanup and trim the value
  String text = node.getNodeValue();
  text = text.replaceAll(\\s+,  );
  text = text.trim();
  if (text.length()  0) {
if (sb.length()  0)

What/how num of required maps is set?

2006-01-09 Thread Gal Nitzan


I am trying to figure out how the required map is set/calculated by
Nutch.

I have 3 task trackers.

I added one more.

When I run fetch only the initial three are fetching.

I have added the task tracker before calling generate (if it has any
meanning)

Thanks,

G.

Re: What/how num of required maps is set? OOP Wrong list

2006-01-09 Thread Gal Nitzan

On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote:
 I am trying to figure out how the required map is set/calculated by
 Nutch.
 
 I have 3 task trackers.
 
 I added one more.
 
 When I run fetch only the initial three are fetching.
 
 I have added the task tracker before calling generate (if it has any
 meanning)
 
 Thanks,
 
 G.

HTMLMetaProcessor a bug?

2006-01-09 Thread Gal Nitzan

Hi,

I was going over the code and I noticed the following in

class org.apache.nutch.parse.html.HTMLMetaProcessor

method getMetaTagsHelper

the following code would fail in case the meta tags are in upper case

Node nameNode = attrs.getNamedItem(name);
Node equivNode = attrs.getNamedItem(http-equiv);
Node contentNode = attrs.getNamedItem(content);


G.

Re: NPE in Indexer.java line 184

2006-01-08 Thread Gal Nitzan

Hi Andrzej,

The value cannot be null is my message :)

060109 094543 task_r_9xvvcz  Could not get property: segment name
060109 094543 task_r_9xvvcz  [Ljava.lang.StackTraceElement;@154864a
060109 094543 task_r_9xvvcz java.lang.NullPointerException: value cannot
be null
060109 094543 task_r_9xvvcz at
org.apache.lucene.document.Field.init(Field.java:469)
060109 094543 task_r_9xvvcz at
org.apache.lucene.document.Field.init(Field.java:412)
060109 094543 task_r_9xvvcz at
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
060109 094543 task_r_9xvvcz at
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:200)
060109 094543 task_r_9xvvcz at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
060109 094543 task_r_9xvvcz at org.apache.nutch.mapred.TaskTracker
$Child.main(TaskTracker.java:603)


Gal

On Sun, 2006-01-08 at 10:07 +0100, Andrzej Bialecki wrote:
 Gal Nitzan wrote:
 
 Hi
 
 While the reduce task is running I sometime get this exception and it
 breaks the whole job.
 
 As a work around I put this line in a try catch and just return however
 I was not sure why the meta can not find the segment key name.
 
 This work around is good for now.
 
   
 
 
 Stacktrace?

NPE in Indexer.java line 184

2006-01-07 Thread Gal Nitzan

Hi

While the reduce task is running I sometime get this exception and it
breaks the whole job.

As a work around I put this line in a try catch and just return however
I was not sure why the meta can not find the segment key name.

This work around is good for now.

G.

Re: mapred crawling exception - Job failed!

2006-01-04 Thread Gal Nitzan

Yes it was fixed. just update your code from trunk.


On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote:
 Lukas Vlcek wrote:
 
 Hi,
 
 I am trying to use the latest nutch-trunk version but I am facing
 unexpected Job failed! exception. It seems that all crawling work
 has been already done but some threads are hunged which results into
 exception after some timeout.
 
   
 
 
 This was fixed (or should be fixed :) in the revision r365576. Please 
 report if it doesn't fix it for you.

NegativeArraySizeException in search server

2006-01-03 Thread Gal Nitzan

When trying to use the search server I get.

I use the trunk from today...

060104 025549 13 Server handler 0 on 9004 call error:
java.io.IOException: java.lang.NegativeArraySizeException
java.io.IOException: java.lang.NegativeArraySizeException
at
org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:35)
at org.apache.lucene.search.HitQueue.init(HitQueue.java:23)
at
org.apache.lucene.search.TopDocCollector.init(TopDocCollector.java:47)
at org.apache.nutch.searcher.LuceneQueryOptimizer
$LimitedCollector.init(LuceneQueryOptimizer.java:52)
at
org.apache.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer.java:153)
at
org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:93)
at
org.apache.nutch.searcher.NutchBean.search(NutchBean.java:155)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:324)
at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:200)

Trunk is broken

2005-12-29 Thread Gal Nitzan

It seems that Trunk is now broken...

In Crawl.java line 111 the parameter for parsing is missing.

For my self I have added the line:

boolean parsing = conf.getBoolean(fetcher.parse, true);

and added the param parsing to 
new Fetcher(conf).fetch(segment, threads, parsing);  // fetch it

Also the Javadoc build has million errors.

Gal

Bug in DeleteDuplicates.java ?

2005-12-29 Thread Gal Nitzan


this function throws IOException. Why?

 public long getPos() throws IOException {
return (doc*INDEX_LENGTH)/maxDoc;
  }

It should be throwing ArithmeticException 

What happens when maxDoc is zero?


Gal

java.io.IOException: Job failed

2005-12-29 Thread Gal Nitzan

Hi,

I am using trunk. while trying to crawl I get the following:



in crawl log:

051229 235114 Dedup: adding indexes in: crawl/indexes
051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/crawl-tool.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml
051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-site.xml
051229 235115 Running job: job_r1bmnj
051229 235116  map 0%
051229 235138  reduce 100%
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.jav
a:309)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:123)



in tasktracker log:

050825 100222 task_m_ns3ehv  Error running child
050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
050825 100222 task_m_ns3ehv at
org.apache.nutch.indexer.DeleteDuplicates
$1.getPos(DeleteDuplicates.java:193)
050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.MapTask
$2.next(MapTask.java:102)
050825 100222 task_m_ns3ehv at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
050825 100222 task_m_ns3ehv at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.TaskTracker
$Child.main(TaskTracker.java:604)

Regards,

Gal

Re: searching return 0 hit

2005-10-18 Thread Gal Nitzan


Hi Michael,

At least on my side every time I run index, I must stop server and than 
tomcat and than re start first server than tomcat.


I have asked about this twice in this list but nobody answered.

I'm not sure it is the same issue, but try it.

Regards,

Gal.


Michael Ji wrote:

Somehow, I found my search engine didn't show the
result, even I can see the index from LukeAll. ( It
works fine before )

I replace ROOT.WAR file in tomcat by nutch's and
launch tomcat in nutch's segment directory ( parallel
to index subdir )

Should I reinstall Tomcat? Or will that be nutch's
indexing issue? My system is running in Linux. 


thanks,

Michael Ji,
-

051019 215411 11 query: com
051019 215411 11 searching for 20 raw hits
051019 215411 11 total hits: 0
051019 215449 12 query request from 65.34.213.205
051019 215449 12 query: net
051019 215449 12 searching for 20 raw hits




__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.

http://music.yahoo.com/unlimited/

.

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-

Attachment: urlfilter-db.tar.gz
AddedDbURLFilter.patch

Fixed some issue with swarm cache (removed loading as daemon).
Code cleanup and remarks
Added some logging

 New plugin urlfilter-db
 ---

  Key: NUTCH-100
  URL: http://issues.apache.org/jira/browse/NUTCH-100
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Versions: 0.8-dev
  Environment: MapRed
 Reporter: Gal Nitzan
 Priority: Trivial
  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz

 Hi,
 I have written (not much) a new plugin, based on the URLFilter interface: 
 urlfilter-db .
 The purpose of this plugin is to filter domains, i.e. I would like to crawl 
 the world but to fetch only certain domains.
 The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and 
 on the back-end a database.
 For each url
filter is called
 end for
 filter
  get the domain name from url
   call cache.get domain
   if not in cache try the database
   if in database cache it and return it
   return null
 end filter
 The plugin reads the cache size, jdbc driver, connection string, table to use 
 and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan


Hi Michael,

At the moment I have about 3000 domains in my db. I didn't time the 
performance however having even 100k domains shouldn't have an impact 
since it is fetched only once from the database to the cache. A little 
performance hit should be over 100k (depends on number elements defined 
in xml file).


After a few birth problems, the plugin works nicely and I do not feel 
any impact.


Regards,

Gal


Michael Ji wrote:

hi,

How is performance concern if the size of domain list
reaches 10,000?

Micheal Ji,

--- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote:

  

 [



http://issues.apache.org/jira/browse/NUTCH-100?page=all
  

]

Gal Nitzan updated NUTCH-100:
-

   type: Improvement  (was: New Feature)
Description: 
Hi,


I have written a new plugin, based on the URLFilter
interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the
URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion



New plugin urlfilter-db
---

 Key: NUTCH-100
 URL:
  

http://issues.apache.org/jira/browse/NUTCH-100


 Project: Nutch
Type: Improvement
  Components: fetcher
Versions: 0.8-dev
 Environment: All Nutch versions
Reporter: Gal Nitzan
Priority: Trivial
 Attachments: AddedDbURLFilter.patch,
  

urlfilter-db.tar.gz, urlfilter-db.tar.gz


Hi,
I have written a new plugin, based on the
  

URLFilter interface: urlfilter-db .


The purpose of this plugin is to filter domains,
  

i.e. I would like to crawl the world but to fetch
only certain domains.


The plugin uses a caching system (SwarmCache,
  

easier to deploy than JCS) and on the back-end a
database.


For each url
   filter is called
end for
filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter
The plugin reads the cache size, jdbc driver,
  

connection string, table to use and domain field
from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of
the administrators:
  



http://issues.apache.org/jira/secure/Administrators.jspa
  

-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira








__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.

http://music.yahoo.com/unlimited/

.

[jira] Created: (NUTCH-100) New plugin urlfilter-db

2005-09-29 Thread Gal Nitzan (JIRA)

New plugin urlfilter-db
---

 Key: NUTCH-100
 URL: http://issues.apache.org/jira/browse/NUTCH-100
 Project: Nutch
Type: New Feature
  Components: fetcher  
Versions: 0.8-dev
 Environment: MapRed
Reporter: Gal Nitzan
Priority: Trivial


Hi,

I have written (not much) a new plugin, based on the URLFilter interface: 
urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the 
world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on 
the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use 
and domain field from nutch-site.xml


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-09-29 Thread Gal Nitzan (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-

Attachment: urlfilter-db.tar.gz

The plugin. Extract, and in myplugin folder read README

 New plugin urlfilter-db
 ---

  Key: NUTCH-100
  URL: http://issues.apache.org/jira/browse/NUTCH-100
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Versions: 0.8-dev
  Environment: MapRed
 Reporter: Gal Nitzan
 Priority: Trivial
  Attachments: urlfilter-db.tar.gz

 Hi,
 I have written (not much) a new plugin, based on the URLFilter interface: 
 urlfilter-db .
 The purpose of this plugin is to filter domains, i.e. I would like to crawl 
 the world but to fetch only certain domains.
 The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and 
 on the back-end a database.
 For each url
filter is called
 end for
 filter
  get the domain name from url
   call cache.get domain
   if not in cache try the database
   if in database cache it and return it
   return null
 end filter
 The plugin reads the cache size, jdbc driver, connection string, table to use 
 and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

70 matches

Mail list logo