Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-12 Thread Doğacan Güney
On Mon, Apr 12, 2010 at 14:08, Andrzej Bialecki a...@getopt.org wrote:
 Hi,

 Take two, after s/crawling/search/ ...

 Following the discussion, below is the text of the proposed Board
 Resolution to vote upon.

 [] +1.  Request the Board make Nutch a TLP
 [] +0.  I don't feel strongly about it, but I'm okay with this.
 [] -1.  No, don't request the Board make Nutch a TLP, and here are my
  reasons...

 This is a majority count vote (i.e. no vetoes). The vote is open for 72
 hours.

 Here's my +1.

And here is my +1.


 ===
 X. Establish the Apache Nutch Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web search
 platform for distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the Apache Nutch Project,
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Nutch Project be and hereby is
 responsible for the creation and maintenance of software
 related to a large-scale web search platform; and be it further

 RESOLVED, that the office of Vice President, Apache Nutch be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Nutch Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Nutch Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Nutch Project:

        • Andrzej Bialecki a...@...
        • Otis Gospodnetic o...@...
        • Dogacan Guney doga...@...
        • Dennis Kubes ku...@...
        • Chris Mattmann mattm...@...
        • Julien Nioche jnio...@...
        • Sami Siren si...@...

 RESOLVED, that the Apache Nutch Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Nutch sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
 be appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed.
 ===


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com







-- 
Doğacan Güney


Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-11 Thread Doğacan Güney
Hi,

On Sat, Apr 10, 2010 at 16:32, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.

 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?


Actually, shouldn't that be something like web search platform, or maybe a
crawling and search platform? Nutch is not just a crawler.

Anyway, +1 from me.

 BR,

 Jukka Zitting




-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-07 18:54, Doğacan Güney wrote:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think 
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.

 Hmm .. this puzzles me, do you think we should port changes from 1.1 to
 nutchbase? I thought we should do it the other way around, i.e. merge
 nutchbase bits to trunk.


Hmm, I am a bit out of touch with the latest changes but I know that
the differences
between trunk and nutchbase are unfortunately rather large right now.
If merging nutchbase
back into trunk would be easier then sure, let's do that.


 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 Again, the advantage of DataNucleus is that we don't have to handcraft
 all the mid- to low-level mappings, just the mid-level ones (JOQL or
 whatever) - the cost of maintenance is lower, and the number of backends
 that are supported out of the box is larger. Of course, this is just
 IMHO - we won't know for sure until we try to use both your custom ORM
 and DataNucleus...

I am obviously a bit biased here but I have no strong feelings really.
DataNucleus
is an excellent project. What I like about avro-based approach is the
essentially free
MapReduce support we get and the fact that supporting another language
is easy. So,
we can expose partial hbase data through a server and a python-client
can easily read/write to it, thanks
to avro. That being said, I am all for DataNucleus or something else.


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
Hi,

On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


At some point, it would be nice to change generator so that it is only a handful
of methods and a pig (or something else) script. So, we would provide
most of the functions
you may need during generation (accessing various data) but actual
generation would be a pig
process. This way, anyone can easily change generate any way they want
(even make it more jobs
than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 --
 Doğacan Güney



 --
 -MilleBii-




-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Thu, Apr 8, 2010 at 21:11, MilleBii mille...@gmail.com wrote:
 Not sure what u mean by pig script, but I'd like to be able to make a
 multi-criteria selection of Url for fetching...

I mean a query language like

http://hadoop.apache.org/pig/

if we expose data correctly, then you should be able to generate on any criteria
that you want.

  The scoring method forces into a kind of mono dimensional approach
 which is not really easy to deal with.

 The regex filters are good but it assumes you want select URLs on data
 which is in the URL... Pretty limited in fact

 I basically would like to do 'content' based crawling. Say for
 example: that I'm interested in topic A.
 I'd'like to label URLs that match Topic A (user supplied logic).
 Later on I would want to crawl topic A urls at a certain frequency
 and non labeled urls for exploring in a different way.

  This looks like hard to do right now

 2010/4/8, Doğacan Güney doga...@gmail.com:
 Hi,

 On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


 At some point, it would be nice to change generator so that it is only a
 handful
 of methods and a pig (or something else) script. So, we would provide
 most of the functions
 you may need during generation (accessing various data) but actual
 generation would be a pig
 process. This way, anyone can easily change generate any way they want
 (even make it more jobs
 than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
 be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical

Re: Nutch 2.0 roadmap

2010-04-07 Thread Doğacan Güney
Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: [ANNOUNCE] New Nutch Committer: Julien Nioche

2009-12-25 Thread Doğacan Güney
On Fri, Dec 25, 2009 at 21:48, Julien Nioche
lists.digitalpeb...@gmail.comwrote:

 Hi,

 Thank you for the warm welcome, I feel very honoured to have been made a
 Nutch committer.


Congratulations and welcome :) !


 A few lines aboyut myself : I started using Lucene back in 2001, made a few
 small contributions to it and started LIMO - an open source web application
 used for monitoring Lucene indices. Over the last 3 years I have used quite
 a few Apache projects such as SOLR, UIMA and of course Nutch, which I
 recently used for a large scale crawling project involving a 400 node
 cluster and 15 billions URLs fetched. My activities at DigitalPebble also
 cover Natural Language Processing (which is my initial background), text
 analysis and I recently started an open source project named Behemoth which
 allows to scale text analysis applications using Hadoop.

 There are quite a few exciting things planned for Nutch in the short term
 and I really look forward to contributing to it in the new year.

 Happy Christmas and best wishes for 2010!

 Julien

 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 2009/12/24 Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov

  All,
 
  A little while ago I nominated Julien Nioche to be Nutch committer based
 on
  his contributions to the Nutch project (10+ patches in this release
 alone,
  and all the mailing list help and thoughtful design discussion). I'm
 happy
  to announce that the Lucene PMC has voted to make Julien a Nutch
 committer!
 
  Julien, welcome to the team. The typical first committer task is to
 modify
  the Nutch Forrest credits page and add yourself to the website. If you'd
  like to say something about yourself and your background, feel free to do
  so
  as well.
 
  Welcome!
 
  Cheers,
  Chris
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.mattm...@jpl.nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
 http://sunset.usc.edu/%7Emattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 




-- 
Doğacan Güney


Re: State of nutchbase

2009-12-07 Thread Doğacan Güney
Hey everyone,

So I restarted nutchbase efforts with adding an abstraction to the hbase
api. The idea is to use an intermediate nutch api (which then talks with
hbase) instead of communicating with hbase directly. This allows us a) to
not be completely tied down to hbase, making a move to another db in the
future easier b) perhaps to immediately support multiple databases with easy
data migration between them.

What I have is very very (VERY) early and extremely alpha but I am quite
happy with overall idea so I am sharing it for suggestions and reviews.
Again, instead of using hbase directly, nutch will use a nice java bean with
getters and setters. Nutch will then figure out what to read/write into
hbase.

I decided to use avro because it has a very clean design. Here is a  very
basic WebTableRow class:
{namespace: org.apache.nutch.storage,
 protocol: Web,

 types: [
 {name: WebTableRow, type: record,
  fields: [
  {name: rowKey, type: string},
  {name: fetchTime, type: long},
  {name: title, type: string},
  {name: text, type: string},
  {name: status, type: int}
  ]
 }
 ]
}

(ignore protocol. I haven't yet figured out how to compile schemas without
protocols)

I have copied and modified avro's SpecificCompiler to generate a java class.
It is mostly the same class as avro's SpecificCompiler however the variables
are all private and are accessed through getters and setters. Here is a
portion of the file:

public class WebTableRow extends NutchTableRow Utf8 implements
SpecificRecord {
  @RowKey // these are used for reflection
  private Utf8 rowKey;
  @RowField
  private long fetchTime;
  @RowField
  private Utf8 title;
  @RowField
  private Utf8 text;
  @RowField
  private int status;
  public Utf8 getRowKey() {  }
  public void setRowKey(Utf8 value) {}
  public long getFetchTime() {  }
  public void setFetchTime(long value) {  }
  .

Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
record. In the future, once hadoop MR supports avro as a serialization
format NutchTableRow-s can easily be output through maps and reduces which
is a nice bonus.

We need to force the usage of setters instead of direct access to variables.
Because one of the nice things about hbase is that you only update the
columns that you changed. However to know which fields are updated (and
thus, map them to hbase columns), we must keep track of what changed.
Currently, NutchTableRow keeps a BitSet for all fields and all setter
functions update this BitSet so we know exactly what changed.

There is also a new interface called NutchSerializer that defines readRow
and writeRow methods(it also needs scans, delete rows etc.. but that's for
later). Currently HbaseSerializer implements NutchSerializer and reads and
writes WebTableRow-s. HbaseSerializer currently works via reflection. It
should be easy to add code generation to our SpecificCompiler so that we can
also output a WebTableRowHbaseSerializer along with WebTableRow instead of
using reflection.

What I have currently can read and write primitive types + strings into and
from hbase. You can check it out from github.com/dogacan/nutchbase (branch
master, package o.a.n.storage). Again, I would like to note that the code is
very very alpha and is not in a good shape but it should be a good starting
point if you are interested.

Once hbase support is solid, I intend to add support for other databases
(bdb, cassandra and sql come to mind). If I got everything right, then
moving data from one database to another is an incredibly trivial task. So,
you can start with, say, bdb then switch over to hbase once your data gets
large.

Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that describes
the mapping between fields and hbase columns:

table name=webtable class=org.apache.nutch.storage.WebTableRow
  description
family name=p/ !-- This can also have params like compression,
bloom filters --
family name=f/
  /description
  fields
field name=fetchTime family=f qualifier=ts/
field name=title family=p qualifier=t/
field name=text family=p qualifier=c/
field name=status family=f qualifier=st/
  /fields

Sorry for the long and rambling email. Feel free to ask if anything is
unclear (and I assume it must be, given my incoherent description :)
-- 
Doğacan Güney


About NUTCH-650 (hbase integration)

2009-08-06 Thread Doğacan Güney
Hey list,
I intended to merge in NUTCH-650 last week but stuff got in the way.
However, I am very close to finishing all the work so give me a few more
days and NUTCH-650 will be in (along with a guide in wiki).

-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-29 Thread Doğacan Güney
Hey guys,

Kirby, thanks for all the insightful information! I couldn't comment
much as most of
the stuff went right over my head :) (don't know much about OSGI).

Andrzej, would OSGI make creating plugins easier? One of the things
that bug me most
about our plugin system is the xml files that need to be created for
every plugin. These files have to be written manually and nutch
doesn't report errors very good here so this process is extremely
error-prone. Do you have something in mind for making this part any
simpler?

On Sun, Jul 26, 2009 at 19:09, Andrzej Bialeckia...@getopt.org wrote:
[..snipping thread as it has gone too long.]

-- 
Doğacan Güney


Re: Running the Crawl without using bin/nutch in side a scala program

2009-07-27 Thread Doğacan Güney
 -         NONE
 2009-07-27 18:49:19,689 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not 
 found.
        at org.apache.nutch.net.URLNormalizers.init(URLNormalizers.java:122)
        at 
 org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
        at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
        at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
        at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

    how to solve this issue any idea please reply to this...


I think $nutch/build/plugins is not in your classpath, but I am not sure.

 Thanks in advance..

 Sailaja



 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the 
 property of Persistent Systems Ltd. It is intended only for the use of the 
 individual or entity to which it is addressed. If you are not the intended 
 recipient, you are not authorized to read, retain, copy, print, distribute or 
 use this message. If you have received this communication in error, please 
 notify the sender and delete all copies of this message. Persistent Systems 
 Ltd. does not accept any liability for virus infected mails.




-- 
Doğacan Güney


Re: Server suggestion

2009-07-24 Thread Doğacan Güney
Hi Dennis,

On Fri, Jul 24, 2009 at 16:46, Dennis Kubesku...@apache.org wrote:


 fredericoagent wrote:

 If I want to setup nutch with lets say 400 million urls in the database.

 Is it better to have a 4-5 super fast and loaded servers or have 12-15
 smaller , cheaper servers.

 More smaller servers.  Make sure they are energy efficient though and have a
 decent amount of Ram.  If a server goes down, you aren't affected as much.


 By superfast I mean cpu is latest quad core or latest six core processor
 with 6 Gigs Ram and 1. or 1.5 TB HD.

 By cheap I mean something like a Xeon quad core 2.26 cpu with 3 Gig Ram
 and
 500 Sata HD.


 or if anyone can suggest a better spec ideal

 Our first servers were 1Ghz (Yes really) running hadoop 0.04 way back when.
  Our first production clusters were core2, 4G ECC, 1 750G hard drive.  These
 days been building i7 8-core, 12G ECC, 4T raid-5 machines with up to 8
 disks, 2U for around 2200.00 each.  If you are looking for a good server
 builder check out swt.com. They are supermicro resellers and build solid
 machines.


It suggests here:

http://en.wikipedia.org/wiki/Core_i7#Drawbacks

that core i7's do not support ECC rams. Have you ran into any issues or is WP
wrong here?


 Suggestions.  Don't skimp on the hard drive, do at least 750G or more. Price
 difference is negligible.  Do at least 2G Ram, 4G is better, 8G is better
 than that.  You can get up to 12G on regular motherboards these days.  After
 that it gets much more expensive.  Ao more recent processors, such as core2
 or i7.  They are more power efficient per processing unit.  If you want a
 really fast machine, do multiple disks in a raid-5 format.

 Dennis




-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-17 Thread Doğacan Güney
Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:
 Hi all,

 I think we should be creating a sandbox area, where we can collaborate
 on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will
 be importing his HBase work as 'nutchbase'. Tika work is the least
 disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
 like to tackle) means significant refactoring so I'd rather put this on a
 branch too.


Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.

 Dogacan, you mentioned that you would like to work on Katta integration.
 Could you shed some light on how this fits with the abstract indexing 
 searching layer that we now have, and how distributed Solr fits into this
 picture?


I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoophbase have, i.e. if a machine dies you
lose that part of the index.

Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com






-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 21:32, Andrzej Bialeckia...@getopt.org wrote:
 Doğacan Güney wrote:

 Hey list,

 On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:

 Hi all,

 I think we should be creating a sandbox area, where we can collaborate
 on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
 will
 be importing his HBase work as 'nutchbase'. Tika work is the least
 disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
 like to tackle) means significant refactoring so I'd rather put this on a
 branch too.


 Thanks for starting the discussion, Andrzej.

 Can you detail your OSGI plugin framework design? Maybe I missed the
 discussion but
 updating the plugin system has been something that I wanted to do for
 a long time :)
 so I am very much interested in your design.

 There's no specific design yet except I can't stand the existing plugin
 framework anymore ... ;) I started reading on OSGI and it seems that it
 supports the functionality that we need, and much more - it certainly looks
 like a better alternative than maintaining our plugin system beyond 1.x ...


Couldn't agree more with the can't stand plugin framework :D

Any good links on OSGI stuff?

 Oh, an additional comment about the scoring API: I don't think the claimed
 benefits of OPIC outweigh the widespread complications that it caused in the
 API. Besides, getting the static scoring right is very very tricky, so from
 the engineer's point of view IMHO it's better to do the computation offline,
 where you have more control over the process and can easily re-run the
 computation, rather than rely on an online unstable algorithm that modifies
 scores in place ...


Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.



 Dogacan, you mentioned that you would like to work on Katta integration.
 Could you shed some light on how this fits with the abstract indexing 
 searching layer that we now have, and how distributed Solr fits into this
 picture?


 I haven't yet given much thought to Katta integration. But basically,
 I am thinking of
 indexing newly-crawled documents as lucene shards and uploading them
 to katta for searching. This should be very possible with the new
 indexing system. But so far, I have neither studied katta too much nor
 given much thought to integration. So I may be missing obvious stuff.

 Me too..

 About distributed solr: I very much like to do this and again, I
 think, this should be possible to
 do within nutch. However, distributed solr is ultimately uninteresting
 to me because (AFAIK) it doesn't have the reliability and
 high-availability that hadoophbase have, i.e. if a machine dies you
 lose that part of the index.

 Grant Ingersoll is doing some initial work on integrating distributed Solr
 and Zookeeper, once this is in a usable shape then I think perhaps it's more
 or less equivalent to Katta. I have a patch in my queue that adds direct
 Hadoop-Solr indexing, using Hadoop OutputFormat. So there will be many
 options to push index updates to distributed indexes. We just need to offer
 the right API to implement the integration, and the current API is IMHO
 quite close.


 Are there any projects going on that are live indexing systems like
 solr, yet are backed up by hadoop HDFS like katta?

 There is the Bailey.sf.net project that fits this description, but it's
 dormant - either it was too early, or there were just too many design
 questions (or simply the committers moved to other things).


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Upgrade to hadoop 0.20?

2009-07-09 Thread Doğacan Güney
On Wed, Jul 8, 2009 at 11:13, Julien Nioche
lists.digitalpeb...@gmail.comwrote:

 Good idea.


OK, it turns out that we can't :D.

MapFileOutputFormat (which we use heavily) is not yet upgraded to hadoop
0.20. We can start using hadoop 0.20 but since we have to use old deprecated
APIs, it doesn't make much sense to me.




 2009/7/8 Doğacan Güney doga...@gmail.com

 Hey list,

 Does anyone have any objections to upgrading to hadoop 0.20? As you may
 know, they have completely
 overhauled the MapReduce API(they still keep old API around but it is
 deprecated). There is a lot of mundane
 work to do to change all our MR code to new API but I can do that.

 So what do you guys think?

 --
 Doğacan Güney




 --
 DigitalPebble Ltd
 http://www.digitalpebble.com




-- 
Doğacan Güney


Upgrade to hadoop 0.20?

2009-07-08 Thread Doğacan Güney
Hey list,

Does anyone have any objections to upgrading to hadoop 0.20? As you may
know, they have completely
overhauled the MapReduce API(they still keep old API around but it is
deprecated). There is a lot of mundane
work to do to change all our MR code to new API but I can do that.

So what do you guys think?

-- 
Doğacan Güney


Re: Build failed in Hudson: Nutch-trunk #857

2009-06-27 Thread Doğacan Güney
, Time elapsed: 1.983 sec
[junit] Running org.apache.nutch.metadata.TestMetadata
[junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.399 sec
[junit] Running org.apache.nutch.metadata.TestSpellCheckedMetadata
[junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 12.149 sec
[junit] Running org.apache.nutch.net.TestURLFilters
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.449 sec
[junit] Running org.apache.nutch.net.TestURLNormalizers
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.933 sec
[junit] Running org.apache.nutch.ontology.TestOntologyFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.498 sec
[junit] Running org.apache.nutch.parse.TestOutlinkExtractor
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.306 sec
[junit] Running org.apache.nutch.parse.TestParseData
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.289 sec
[junit] Running org.apache.nutch.parse.TestParseText
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.375 sec
[junit] Running org.apache.nutch.parse.TestParserFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.258 sec
[junit] Running org.apache.nutch.plugin.TestPluginSystem
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 2.139 sec
[junit] Running org.apache.nutch.protocol.TestContent
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.506 sec
[junit] Running org.apache.nutch.protocol.TestProtocolFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.206 sec
[junit] Running org.apache.nutch.searcher.TestHitDetails
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.284 sec
[junit] Running org.apache.nutch.searcher.TestOpenSearchServlet
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.317 sec
[junit] Running org.apache.nutch.searcher.TestQuery
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.042 sec
[junit] Running org.apache.nutch.searcher.TestSummarizerFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.905 sec
[junit] Running org.apache.nutch.searcher.TestSummary
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.379 sec
[junit] Running org.apache.nutch.util.TestEncodingDetector
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.627 sec
[junit] Running org.apache.nutch.util.TestGZIPUtils
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.43 sec
[junit] Running org.apache.nutch.util.TestNodeWalker
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.869 sec
[junit] Test org.apache.nutch.util.TestNodeWalker FAILED
[junit] Running org.apache.nutch.util.TestPrefixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.346 sec
[junit] Running org.apache.nutch.util.TestStringUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.285 sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.362 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.743 sec

 BUILD FAILED
 http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build.xml:304: 
 Tests failed!

 Total time: 5 minutes 37 seconds
 Publishing Javadoc
 Recording test results




-- 
Doğacan Güney


Why does TestNodeWalker keep failing?

2009-06-12 Thread Doğacan Güney
Hi all,

Does anyone know why TestNodeWalker keeps failing
for the last couple of days?

I can reproduce the error in my computer; test log looks like
this:

Testsuite: org.apache.nutch.util.TestNodeWalker
Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.101 sec
- Standard Error -
java.io.IOException: Server returned HTTP response code: 503 for URL:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1241)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown
Source)
at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown
Source)
at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
-  ---

Testcase: testSkipChildren took 1.095 sec
FAILED
UL Content can NOT be found in the node
junit.framework.AssertionFailedError: UL Content can NOT be found in the
node
at
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:79)

I have no idea why we get a 503 there?

-- 
Doğacan Güney


Re: IOException in dedup

2009-06-03 Thread Doğacan Güney
On Tue, Jun 2, 2009 at 20:13, Nic M nicde...@gmail.com wrote:


 On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:

 Hello,


 I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for Mac OS
 X. When I try to start crawling I get the following exception:


 Dedup: starting

 Dedup: adding indexes in: crawl/indexes

 Exception in thread main java.io.IOException: Job failed!

 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

 at
 org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)

 at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)



 Does anyone know how to solve this problem?


You may be running into this problem:

https://issues.apache.org/jira/browse/NUTCH-525

I suggest trying updating to 1.0 or applying the patch there.



 You can get an IOException reported by Hadoop when the root cause is that
 you've run out of memory. Normally the hadoop.log file would have the OOM
 exception.

 If you're running from inside of Eclipse, see
 http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details.

 -- Ken

 --

 Ken Krugler
 +1 530-210-6378


 Thank you for the pointers Ken. I changed the VM memory parameters as shown
 at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I still get
 the exception and in Hadoop log I have the following exception

 2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup: starting
 2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup: adding
 indexes in: crawl/indexes
 2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
 java.lang.ArrayIndexOutOfBoundsException: -1
 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
  at
 org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
 at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

 I am running Lucene 2.1.0. Any idea why I am getting the
 ArrayIndexOutofBoundsEception?

 Nic






-- 
Doğacan Güney


Re: Infinite loop bug in Nutch 0.9

2009-04-02 Thread Doğacan Güney
On Wed, Apr 1, 2009 at 13:29, George Herlin ghher...@gmail.com wrote:

 Sorry, forgot to say, there is an added precondition to causing the bug:

 The redirection has to be fetched before the page it redirects to... if
 not, there will be a pre.existing crawl datum with an reasonable
 refetch-interval.


Maybe this is something fixed between 0.9 and 1.0, but I think
CrawlDbReducer fixes these datums, around line 147 (case
CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop
because of it?




 2009/4/1 George Herlin ghher...@gmail.com

 Hello, there.

 I believe I may have found a infinite loop in Nutch 0.9.

 It happens when a site has a page that refers to itself through a
 redirection.

 The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a
 little modified, line numbers may vary a little - says, for that case:

 output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED);

 What that does is, inserts an extra (empty) crawl datum for the new url,
 with a re-fetch interval of 0.0.

 However, (see Generator.Selector.map(), particularly lines 144-145), the
 non-refetch condition used seems to be last-fetch+refetch-intervalnow ...
 which is always false if refetch-interval==0.0!

 Now, if there is a new link to the new url in that page, that crawl datum
 is re-used, and the whole thing loops indefinitely.

 I've fixed that for myself by changing the quoted line (twice) by:

 output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null,
 CrawlDatum.STATUS_LINKED);

 and that works (btw the 30F should really be the value of
 db.default.fetch.interval, but I haven't the time right now to work out
 the issues, but in reality the default constructor and the appropriate
 updater method should, if I am right in analysing the algorithm always
 enforce a positive refetch interval.

 Of course, another method could be used to remove this self-reference, but
 that couls be complicated, as that may happen through a loop (2 or more
 pages etc..., you know what I mean).

 Has that been fixed already, and by what method?

 Best regards

 George Herlin







-- 
Doğacan Güney


Re: [VOTE] Release Apache Nutch 1.0

2009-03-26 Thread Doğacan Güney
So anyone else? Anyone?

On Wed, Mar 25, 2009 at 17:17, Dennis Kubes ku...@apache.org wrote:

 +1, is this binding? :)

 Dog(acan Güney wrote:

 Another non-binding +1 from me.

 Hope this one is a keeper :D

 On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com mailto:
 ssi...@gmail.com wrote:

Hello,

I have packaged the third release candidate for Apache Nutch 1.0
release at 
 http://people.apache.org/~siren/nutch-1.0/rc2/http://people.apache.org/%7Esiren/nutch-1.0/rc2/
http://people.apache.org/%7Esiren/nutch-1.0/rc2/

See the CHANGES.txt[1] file for details on release contents and
latest changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/

The following issues that were discovered during the review of last
rc have been fixed:

https://issues.apache.org/jira/browse/NUTCH-722
https://issues.apache.org/jira/browse/NUTCH-723
https://issues.apache.org/jira/browse/NUTCH-725
https://issues.apache.org/jira/browse/NUTCH-726
https://issues.apache.org/jira/browse/NUTCH-727

Please vote on releasing this package as Apache Nutch 1.0. The vote
is open for the next 72 hours. Only votes from Lucene PMC members
are binding, but everyone is welcome to check the release candidate
and voice their approval or disapproval. The vote  passes if at
least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1]

 http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
--Sami Siren




 --
 Dog(acan Güney




-- 
Doğacan Güney


Re: Announce: New PMC member Dennis Kubes

2009-03-25 Thread Doğacan Güney
On Wed, Mar 25, 2009 at 12:24, Andrzej Bialecki a...@getopt.org wrote:

 Hi all,

 The Lucene Project Management Committee is happy to announce that Dennis
 Kubes has been voted in as a new PMC member. He is the third Nutch committer
 to represent this project there, and his experience and excellent work on
 Nutch will be also useful in the broader context of the whole Lucene
 project.

 Congratulations, Dennis!


Congratulations!



 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Announce: New PMC member Dennis Kubes

2009-03-25 Thread Doğacan Güney
Btw, can Dennis be the 3rd +1 that we need so we can finally release
1.0 :D ?

On Wed, Mar 25, 2009 at 16:47, Mattmann, Chris A 
chris.a.mattm...@jpl.nasa.gov wrote:

 Here here. Deservedly so!

 Great job, Dennis!

 Cheers,
 Chris



 On 3/25/09 3:27 AM, Jukka Zitting jukka.zitt...@gmail.com wrote:

  Hi,
 
  On Wed, Mar 25, 2009 at 11:24 AM, Andrzej Bialecki a...@getopt.org
 wrote:
  The Lucene Project Management Committee is happy to announce that Dennis
  Kubes has been voted in as a new PMC member.
 
  Hip, hip, hurray! Congratulations, Dennis!
 
  BR,
 
  Jukka Zitting
 

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






-- 
Doğacan Güney


Re: [VOTE] Release Apache Nutch 1.0

2009-03-23 Thread Doğacan Güney
Another non-binding +1 from me.

Hope this one is a keeper :D

On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com wrote:

 Hello,

 I have packaged the third release candidate for Apache Nutch 1.0 release at
 http://people.apache.org/~siren/nutch-1.0/rc2/http://people.apache.org/%7Esiren/nutch-1.0/rc2/

 See the CHANGES.txt[1] file for details on release contents and latest
 changes. The release was made from tag:
 http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/

 The following issues that were discovered during the review of last rc have
 been fixed:

 https://issues.apache.org/jira/browse/NUTCH-722
 https://issues.apache.org/jira/browse/NUTCH-723
 https://issues.apache.org/jira/browse/NUTCH-725
 https://issues.apache.org/jira/browse/NUTCH-726
 https://issues.apache.org/jira/browse/NUTCH-727

 Please vote on releasing this package as Apache Nutch 1.0. The vote is open
 for the next 72 hours. Only votes from Lucene PMC members are binding, but
 everyone is welcome to check the release candidate and voice their approval
 or disapproval. The vote  passes if at least three binding +1 votes are
 cast.

 [ ] +1 Release the packages as Apache Nutch 1.0
 [ ] -1 Do not release the packages because...

 Here's my +1


 Thanks!


 [1]
 http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
 --
 Sami Siren




-- 
Doğacan Güney


Re: Problems compiling Nutch in Eclipse

2009-03-21 Thread Doğacan Güney
RTF parser is not built by default because the jars it uses has some
licensing issues. And it is out of sync with current trunk so it
does not even build anymore.

This issue may help:
https://issues.apache.org/jira/browse/NUTCH-644

On Sat, Mar 21, 2009 at 03:02, Rodrigo Reyes C. rre...@corbitecso.com wrote:
 Hi

 I have configured my eclipse project as stated here

 http://wiki.apache.org/nutch/RunNutchInEclipse0.9

 Still, I am getting the following errors:

 The return type is incompatible with Parser.getParse(Content)
 RTFParseFactory.java
 nutch/src/plugin/parse-rtf/src/java/org/apache/nutch/parse/rtf    line 52
 Java Problem
 Type mismatch: cannot convert from ParseResult to Parse
 TestRTFParser.java
 nutch/src/plugin/parse-rtf/src/test/org/apache/nutch/parse/rtf    line 78
 Java Problem

 Any ideas on what could be wrong? I already included both
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ and
 http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ jars.

 Thanks in advance

 --
 Rodrigo Reyes C.





-- 
Doğacan Güney


Re: [DISCUSS] contents of nutch release artifact

2009-03-20 Thread Doğacan Güney
On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote:
 Sami Siren wrote:

 Andrzej Bialecki wrote:

 How about the following: we build just 2 packages:

 * binary: this includes only base hadoop libs in lib/ (enough to start a
 local job, no optional filesystems etc), the *.job and *.war files and
 scripts. Scripts would check for the presence of plugins/ dir, and offer an
 option to create it from *.job. Assumption here is that this shouldbe enough
 to run full cycle in local mode, and that people who want to run a
 distributed cluster will first install a plain Hadoop release, and then just
 put the *.job and bin/nutch on the master.

 * source: no build artifacts, no .svn (equivalent to svn export), simple
 tgz.


 this sounds good to me. additionally some new documentation needs to be
 written too.


 I added a simple patch to NUTCH-728 to make a plain source release from svn,
 what do people think should we add the plain source package into next rc. I
 would not like to make changes to binary package now but propose that we do
 those changes post 1.0.


+1 for including plain source release in next rc.

As for, local/distributed separation, it is a good idea but I think we
should hold
it for 1.1 (or something else) if it requires architectural changes
(thus needs review
and testing).

 --
  Sami Siren




-- 
Doğacan Güney


Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Doğacan Güney
On Thu, Mar 19, 2009 at 16:48, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote:
 (anyway, what's a measly 90MB nowadays .. ;)

 It's a pretty long download unless you have a fast connection and a
 nearby mirror.


I agree. Can't we also do a source-only release? Kind of like a checkout from
svn (without, of course, svn bits)? I think this would be much more interesting
to me if I wasn't using trunk.

So, my suggestion is that we have 3 releases? Source only, binary only and full.


 BR,

 Jukka Zitting




-- 
Doğacan Güney


Re: [VOTE] Release Apache Nutch 1.0

2009-03-10 Thread Doğacan Güney

Again, my non-binding +1 :)

On 10.Mar.2009, at 09:34, Sami Siren ssi...@gmail.com wrote:


Hello,

I have packaged the second release candidate for Apache Nutch 1.0  
release at


http://people.apache.org/~siren/nutch-1.0/rc1/

See the CHANGES.txt[1] file for details on release contents and  
latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004


Please vote on releasing this package as Apache Nutch 1.0. The vote  
is open for the next 72 hours. Only votes from Lucene PMC members  
are binding, but everyone is welcome to check the release candidate  
and voice their approval or disapproval. The vote  passes if at  
least three binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004

--
Sami Siren



Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Doğacan Güney



On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com wrote:


Doğacan Güney wrote:


On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote:


Hello,

I have packaged the first release candidate for Apache Nutch 1.0  
release at


http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents  
and latest

changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

Please vote on releasing this package as Apache Nutch 1.0. The  
vote is open
for the next 72 hours. Only votes from Lucene PMC members are  
binding, but
everyone is welcome to check the release candidate and voice their  
approval
or disapproval. The vote  passes if at least three binding +1  
votes are

cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!



That's great!

I would like to see NUTCH-684 in but I guess I was too late :)

Anyway, my non-binding +1.



uh, I missed that one, sorry. Do you think it's ready to be  
included? (IMO that's an important feature) It's not a big deal for  
me to rebuild the package with that feature included.




I only tested it on a small crawl. Still, I believe it is important  
too so I would like to include it. Worst case we release a 1.0.1 soon  
after:)



--
 Sami Siren



Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Doğacan Güney
On Mon, Mar 9, 2009 at 17:46, Sami Siren ssi...@gmail.com wrote:
 Doğacan Güney wrote:


 On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com
 mailto:ssi...@gmail.com wrote:

 Doğacan Güney wrote:

 On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com
 mailto:ssi...@gmail.com wrote:


 Hello,

 I have packaged the first release candidate for Apache Nutch 1.0
 release at

 http://people.apache.org/~siren/nutch-1.0/rc0/
 http://people.apache.org/%7Esiren/nutch-1.0/rc0/

 See the included CHANGES.txt file for details on release contents and
 latest
 changes. The release was made from tag:

 http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

 Please vote on releasing this package as Apache Nutch 1.0. The vote is
 open
 for the next 72 hours. Only votes from Lucene PMC members are binding,
 but
 everyone is welcome to check the release candidate and voice their
 approval
 or disapproval. The vote  passes if at least three binding +1 votes are
 cast.

 [ ] +1 Release the packages as Apache Nutch 1.0
 [ ] -1 Do not release the packages because...

 Thanks!



 That's great!

 I would like to see NUTCH-684 in but I guess I was too late :)

 Anyway, my non-binding +1.


 uh, I missed that one, sorry. Do you think it's ready to be included?
 (IMO that's an important feature) It's not a big deal for me to rebuild the
 package with that feature included.


 I only tested it on a small crawl. Still, I believe it is important too so
 I would like to include it. Worst case we release a 1.0.1 soon after:)

 I am fine either way. So if you think it's good enough to go in just commit
 it and I'll build another rc. If not then we can release it later too when
 it's ready.


Committed, thanks for waiting :)

 --
 Sami Siren



 --
  Sami Siren






-- 
Doğacan Güney


Re: [VOTE] Release Apache Nutch 1.0

2009-03-08 Thread Doğacan Güney
On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote:
 Hello,

 I have packaged the first release candidate for Apache Nutch 1.0 release at

 http://people.apache.org/~siren/nutch-1.0/rc0/

 See the included CHANGES.txt file for details on release contents and latest
 changes. The release was made from tag:
 http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

 Please vote on releasing this package as Apache Nutch 1.0. The vote is open
 for the next 72 hours. Only votes from Lucene PMC members are binding, but
 everyone is welcome to check the release candidate and voice their approval
 or disapproval. The vote  passes if at least three binding +1 votes are
 cast.

 [ ] +1 Release the packages as Apache Nutch 1.0
 [ ] -1 Do not release the packages because...

 Thanks!


That's great!

I would like to see NUTCH-684 in but I guess I was too late :)

Anyway, my non-binding +1.

 --
 Sami Siren










-- 
Doğacan Güney


Re: Release 1.0?

2009-02-28 Thread Doğacan Güney
On Sat, Feb 28, 2009 at 10:00, Sami Siren ssi...@gmail.com wrote:
 dealmaker wrote:

 Hi,
  Is there going to be a delay of the 1.0 release?  Today is almost Feb 28.
 You said that 1.0 will come in Feb.  I am customizing Nutch 0.9, and I am
 wondering if I should wait couple more days for the 1.0 release.


 I think that no one else but me made any guesses about the release date?
 (since it is virtually impossible due to fact that this is not a paid
 project).

 The general consensus seems to be that we should get the next release out
 preferably sooner than later. I personally still think that the first
 release candidate is not that far away - we have no blocker issues left and
 it seems (judged by the lack of activity on working with those remaining
 issues) that the ones still there are not too important.

 I am going to commit NUTCH-669 soon and after that I am fine with starting
 the release process. Other devs might have different opinions.


+1. I will finish solr dedup tomorrow, after that I have no more
issues I want to
address before 1.0.

 --
 Sami Siren





 --
 Sami Siren

 Thanks.


 Andrzej Bialecki wrote:


 Marko Bauhardt wrote:


 Hi,
 is there anybody out there? ;)
 exists a plan when version 1.0 will be released?

 thanks
 marko


 On Jan 28, 2009, at 9:45 AM, Marko Bauhardt wrote:



 Hi all,
 is there a timeline for the release 1.0? Currently it exists 33 issues
 (9 Bugs).
 Is there a plan for a feature freeze? Maybe some big issues can be
 moved to version 1.1?


 We do exist. ;) We plan to release in February - I can't tell you yet
 when exactly, we need to review the (few) remaining issues that we want to
 resolve before the release.



 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com











-- 
Doğacan Güney


Re: Nutch ScoringFilter plugin problems

2009-01-26 Thread Doğacan Güney
On Mon, Jan 26, 2009 at 2:17 PM, Pau pau...@gmail.com wrote:
 Hello,
 I still have the same problem. I have the following piece of code

   if (linkdb == null) {
 System.out.println(Null linkdb);
   } else {
 System.out.println(LinkDB not null);
   }
   Inlinks inlinks = linkdb.getInlinks(url);
   System.out.println(a);

 On the output I can see it always prints LinkDB not null, so linkdb is not
 null. But a never gets printed, so I guess that at:  Inlinks inlinks =
 linkdb.getInlinks(url);  there is some error. Maybe the getInlinks function
 throws an IOException?
 I do catch the IOException, but the catch block is never executed either.


It is very difficult to guess without seeing the exception. Maybe you can try
catching everything (i.e Throwable) and printing it?

 One question, how should I create the LinkDBReader? I do it the following
 way:
  linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb));
 Is it right? Thanks.


 On Wed, Jan 21, 2009 at 10:16 AM, Pau pau...@gmail.com wrote:

 Ok, I think you are right, maybe inlinks is null. I will try it now.
 Thank you!
 I have no information about the exception. It seems that simply the
 program skips this part of the code... maybe a ScoringFilterExcetion is
 thrown?

 On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney doga...@gmail.com wrote:

 On Tue, Jan 20, 2009 at 7:18 PM, Pau pau...@gmail.com wrote:
  Hello,
  I want to create a new ScoringFilter plugin. In order to evaluate how
  interesting a web page is, I need information about the link structure
  in
  the LinkDB.
  In the method updateDBScore, I have the following lines (among others):
 
  88linkdb = new LinkDbReader(getConf(),
  new
  Path(crawl/linkdb));
  ...
  99System.out.println(Inlinks to  +
  url);
 100Inlinks inlinks =
  linkdb.getInlinks(url);
 101System.out.println(a);
 102IteratorInlink iIt =
  inlinks.iterator();
 103System.out.println(b);
 
  a always gets printed, but b rarely gets printed, so this seems
  that in
  line 102 an error happens, and an exeception is raised. Do you know why
  this
  is happening? What am I doing wrong? Thanks.
 

 Maybe there are no inlinks to that page so inlinks is null? What is
 the exception
 exactly?

 



 --
 Doğacan Güney






-- 
Doğacan Güney


Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-23 Thread Doğacan Güney
So, is it OK to remove pmd-ext directory for now? It is not clear if
we need it when
we have the infrastructure but we don't have the infrastructure now
anyway :D. So,
I suggest that we remove it for now (and we trim 2.2MB ), and add it
back after 1.0
and actually use it.

Is everyone OK with this?

On Wed, Jan 21, 2009 at 12:01 AM, Piotr Kosiorowski
pkosiorow...@gmail.com wrote:
 I have configured hudson for 10 or more projects and always used pmd
 plugin to display the pmd results only - the actual pmd task to
 generate report was run from ant script. Maybe there is such
 possibility tu run pmd reports directly in hudson (not through project
 build scripts) but I have never come accross it.
 Piotr

 On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic
 ogjunk-nu...@yahoo.com wrote:
 They've had pmd integrated with Hudson for many months now, I believe.  I've 
 seen patches in JIRA that were the result of fixes for problems reported by 
 pmd.  Or maybe they run pmd by hand?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 3:40:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 wrote:
  That I don't know...
 
  I don't see the jars here: 
  http://svn.apache.org/viewvc/hadoop/core/trunk/lib/
 
  But who knows, maybe maven/ivy fetch them on demand.  I don't know.
 

 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 1:13:20 PM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
  wrote:
   Lucene doesn't use anything.
   Hadoop uses pmd integrate in Hudson.
  
 
  Does this mean we do not need pmd jars in nutch ( are they provided by
 hudson)?
 
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: Doğacan Güney
   To: nutch-dev@lucene.apache.org
   Sent: Tuesday, January 20, 2009 10:49:44 AM
   Subject: Re: [jira] Created: (NUTCH-680) Update external jars to 
   latest
  versions
  
   2009/1/20 Piotr Kosiorowski :
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion 
around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am 
using
it in my daily job and I found it valuable.
  
   Thanks for the explanation. I am definitely +1 on having some sort of
   static analysis tools for nutch.
  
   Does anyone know what hadoop/hbase/lucene use for this? or do
   they use something at all?
  
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move 
forward
slowly - my youngest kid was born just 2 months ago and it takes a 
lot
of attention).
  
   Congratulations!
  
Piotr
   
On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
Update external jars to latest versions
---
   
Key: NUTCH-680
URL: 
https://issues.apache.org/jira/browse/NUTCH-680
Project: Nutch
 Issue Type: Improvement
   Reporter: Doğacan Güney
   Assignee: Doğacan Güney
   Priority: Minor
Fix For: 1.0.0
   
   
This issue will be used to update external libraries nutch uses.
   
These are the libraries that are outdated (upon a quick glance):
   
nekohtml (1.9.9)
lucene-highlighter (2.4.0)
jdom (1.1)
carrot2 - as mentioned in another issue
jets3t - above
icu4j (4.0.1)
jakarta-oro (2.0.8)
   
We should probably update tika to whatever the latest is as well 
before
  1.0.
   
   
Please add ones  I missed in comments.
   
Also what exactly is pmd-ext? There is an extra jakarta-oro and 
jaxen
   there.
   
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
   
   
   
  
  
  
   --
   Doğacan

Re: Nutch ScoringFilter plugin problems

2009-01-21 Thread Doğacan Güney
On Tue, Jan 20, 2009 at 7:18 PM, Pau pau...@gmail.com wrote:
 Hello,
 I want to create a new ScoringFilter plugin. In order to evaluate how
 interesting a web page is, I need information about the link structure in
 the LinkDB.
 In the method updateDBScore, I have the following lines (among others):

 88linkdb = new LinkDbReader(getConf(), new
 Path(crawl/linkdb));
 ...
 99System.out.println(Inlinks to  + url);
100Inlinks inlinks = linkdb.getInlinks(url);
101System.out.println(a);
102IteratorInlink iIt = inlinks.iterator();
103System.out.println(b);

 a always gets printed, but b rarely gets printed, so this seems that in
 line 102 an error happens, and an exeception is raised. Do you know why this
 is happening? What am I doing wrong? Thanks.


Maybe there are no inlinks to that page so inlinks is null? What is
the exception
exactly?





-- 
Doğacan Güney


Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Doğacan Güney
2009/1/20 Piotr Kosiorowski pkosiorow...@gmail.com:
 pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
 committed them long time ago in an attempt to bring some static
 analysis toools to nutch sources. There was a short discussion around
 it and we all thought t was worth doing but it never gained enough
 momentum.   There is a pmd target in build.xml file that uses it -
 they are not needed in runtime nor for standard builds.
 As nutch is built using hudson now I think it would be worth to
 integrate pmd (and checkstyle/findbugs/cobertura might be also
 interesting) - hudson has very nice plugins for such tools. I am using
 it in my daily job and I found it valuable.

Thanks for the explanation. I am definitely +1 on having some sort of
static analysis tools for nutch.

Does anyone know what hadoop/hbase/lucene use for this? or do
they use something at all?

 But as I am not active committer now (I only try to follow mailing
 lists) I do not think it is my call.  But if everyone will be
 interested I can try to look at integration (but it will move forward
 slowly - my youngest kid was born just 2 months ago and it takes a lot
 of attention).

Congratulations!

 Piotr

 On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote:
 Update external jars to latest versions
 ---

 Key: NUTCH-680
 URL: https://issues.apache.org/jira/browse/NUTCH-680
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0


 This issue will be used to update external libraries nutch uses.

 These are the libraries that are outdated (upon a quick glance):

 nekohtml (1.9.9)
 lucene-highlighter (2.4.0)
 jdom (1.1)
 carrot2 - as mentioned in another issue
 jets3t - above
 icu4j (4.0.1)
 jakarta-oro (2.0.8)

 We should probably update tika to whatever the latest is as well before 1.0.


 Please add ones  I missed in comments.

 Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen 
 there.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.






-- 
Doğacan Güney


Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Doğacan Güney
On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:
 Lucene doesn't use anything.
 Hadoop uses pmd integrate in Hudson.


Does this mean we do not need pmd jars in nutch ( are they provided by hudson)?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 10:49:44 AM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 2009/1/20 Piotr Kosiorowski :
  pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
  committed them long time ago in an attempt to bring some static
  analysis toools to nutch sources. There was a short discussion around
  it and we all thought t was worth doing but it never gained enough
  momentum.   There is a pmd target in build.xml file that uses it -
  they are not needed in runtime nor for standard builds.
  As nutch is built using hudson now I think it would be worth to
  integrate pmd (and checkstyle/findbugs/cobertura might be also
  interesting) - hudson has very nice plugins for such tools. I am using
  it in my daily job and I found it valuable.

 Thanks for the explanation. I am definitely +1 on having some sort of
 static analysis tools for nutch.

 Does anyone know what hadoop/hbase/lucene use for this? or do
 they use something at all?

  But as I am not active committer now (I only try to follow mailing
  lists) I do not think it is my call.  But if everyone will be
  interested I can try to look at integration (but it will move forward
  slowly - my youngest kid was born just 2 months ago and it takes a lot
  of attention).

 Congratulations!

  Piotr
 
  On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
  Update external jars to latest versions
  ---
 
  Key: NUTCH-680
  URL: https://issues.apache.org/jira/browse/NUTCH-680
  Project: Nutch
   Issue Type: Improvement
 Reporter: Doğacan Güney
 Assignee: Doğacan Güney
 Priority: Minor
  Fix For: 1.0.0
 
 
  This issue will be used to update external libraries nutch uses.
 
  These are the libraries that are outdated (upon a quick glance):
 
  nekohtml (1.9.9)
  lucene-highlighter (2.4.0)
  jdom (1.1)
  carrot2 - as mentioned in another issue
  jets3t - above
  icu4j (4.0.1)
  jakarta-oro (2.0.8)
 
  We should probably update tika to whatever the latest is as well before 
  1.0.
 
 
  Please add ones  I missed in comments.
 
  Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
 there.
 
  --
  This message is automatically generated by JIRA.
  -
  You can reply to this email to add a comment to the issue online.
 
 
 



 --
 Doğacan Güney





-- 
Doğacan Güney


Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Doğacan Güney
On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:
 That I don't know...

 I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/

 But who knows, maybe maven/ivy fetch them on demand.  I don't know.


Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 1:13:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
 wrote:
  Lucene doesn't use anything.
  Hadoop uses pmd integrate in Hudson.
 

 Does this mean we do not need pmd jars in nutch ( are they provided by 
 hudson)?

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 10:49:44 AM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  2009/1/20 Piotr Kosiorowski :
   pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
   committed them long time ago in an attempt to bring some static
   analysis toools to nutch sources. There was a short discussion around
   it and we all thought t was worth doing but it never gained enough
   momentum.   There is a pmd target in build.xml file that uses it -
   they are not needed in runtime nor for standard builds.
   As nutch is built using hudson now I think it would be worth to
   integrate pmd (and checkstyle/findbugs/cobertura might be also
   interesting) - hudson has very nice plugins for such tools. I am using
   it in my daily job and I found it valuable.
 
  Thanks for the explanation. I am definitely +1 on having some sort of
  static analysis tools for nutch.
 
  Does anyone know what hadoop/hbase/lucene use for this? or do
  they use something at all?
 
   But as I am not active committer now (I only try to follow mailing
   lists) I do not think it is my call.  But if everyone will be
   interested I can try to look at integration (but it will move forward
   slowly - my youngest kid was born just 2 months ago and it takes a lot
   of attention).
 
  Congratulations!
 
   Piotr
  
   On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
   Update external jars to latest versions
   ---
  
   Key: NUTCH-680
   URL: https://issues.apache.org/jira/browse/NUTCH-680
   Project: Nutch
Issue Type: Improvement
  Reporter: Doğacan Güney
  Assignee: Doğacan Güney
  Priority: Minor
   Fix For: 1.0.0
  
  
   This issue will be used to update external libraries nutch uses.
  
   These are the libraries that are outdated (upon a quick glance):
  
   nekohtml (1.9.9)
   lucene-highlighter (2.4.0)
   jdom (1.1)
   carrot2 - as mentioned in another issue
   jets3t - above
   icu4j (4.0.1)
   jakarta-oro (2.0.8)
  
   We should probably update tika to whatever the latest is as well before
 1.0.
  
  
   Please add ones  I missed in comments.
  
   Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
  there.
  
   --
   This message is automatically generated by JIRA.
   -
   You can reply to this email to add a comment to the issue online.
  
  
  
 
 
 
  --
  Doğacan Güney
 
 



 --
 Doğacan Güney





-- 
Doğacan Güney


Re: RSS-fecter and index individul-how can i realize this function

2009-01-05 Thread Doğacan Güney
On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote:
 Hello
 I'm trying to make RSSParser do something simmilar to FeedParser (which
 doesn't work quite right) - that is, instead of indexing the whole contents

Why doesn't FeedParser work? Let's fix whatever is broken in it :D

 of the feed, I want it to show individual items, with their respective title
 and and proper link to the article I realize that I could index 1 depth
 more, but I'd like to index just the feed, not the articles that go with it
 (keep the index small and the crawl fast).

 For each item in each RSS channel (the code does not differ much for
 getParse() of RSSParser.java) I do something like

  Outlink[] outlinks = new Outlink[1];
  try{
   outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle());
  } catch (Exception e) {
   continue;
  }

  parseResult.put(
   whichLink,
   new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()),
   new ParseData(
 ParseStatus.STATUS_SUCCESS,
 theRSSItem.getTitle(),
 outlinks,
 new Metadata() //was content.getMetadata()
   )
  );

 The problem is, however, that only one item from the whole RSS gets into the
 index, although in the log I can see them all ( I've tried it with feeds
 from cnn and reuters). What happens? Why do they get overwritten in a
 seemingly random order? The item that makes it into the index is neither the
 first nor the last, but appears to be the same until new items appear in the
 feed.

 Thank you,
 Vlad





-- 
Doğacan Güney


Re: readlinkdb fails to dump linkdb

2008-12-04 Thread Doğacan Güney
On Thu, Dec 4, 2008 at 11:33 AM, brainstorm [EMAIL PROTECTED] wrote:
 On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
 On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote:
 Using nutch 0.9 (hadoop 0.17.1):

 [EMAIL PROTECTED] working]$ bin/nutch readlinkdb
 /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
 LinkDb dump: starting
 LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
  

 It seems you are providing a crawldb as argument. You should pass the linkdb.


 Thanks a lot for the hint, but I cannot find linkdb dir anywhere on
 the HDFS :_/ Can you point me where should it be ?

A linkdb is created with the command: invertlinks, e.g:

bin/nutch invertlinks crawl/linkdb crawl/segments/



 java.io.IOException: Type mismatch in value from map: expected
 org.apache.nutch.crawl.Inlinks, recieved
 org.apache.nutch.crawl.CrawlDatum
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at 
 org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 LinkDbReader: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at 
 org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

 This is the first time I use readlinkdb and the rest of the crawling
 process is working ok, I've looked up JIRA and there's no related bug.

 I've also tried latest trunk nutch but DFS is not working for me:

 [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls

 Exception in thread main java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more

 Should I file both bugs on JIRA ?


 This I am not sure, but did you try ant clean; ant? It may be a
 version mismatch.


 Yes, I did ant clean  ant before trying the above command. I also
 tried to upgrade the filesystem unsuccessfully and even created it
 from scratch:

 https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650556#action_12650556



 --
 Doğacan Güney





-- 
Doğacan Güney


Re: readlinkdb fails to dump linkdb

2008-12-03 Thread Doğacan Güney
On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote:
 Using nutch 0.9 (hadoop 0.17.1):

 [EMAIL PROTECTED] working]$ bin/nutch readlinkdb
 /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
 LinkDb dump: starting
 LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
  

It seems you are providing a crawldb as argument. You should pass the linkdb.

 java.io.IOException: Type mismatch in value from map: expected
 org.apache.nutch.crawl.Inlinks, recieved
 org.apache.nutch.crawl.CrawlDatum
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at 
 org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 LinkDbReader: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at 
 org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

 This is the first time I use readlinkdb and the rest of the crawling
 process is working ok, I've looked up JIRA and there's no related bug.

 I've also tried latest trunk nutch but DFS is not working for me:

 [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls

 Exception in thread main java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more

 Should I file both bugs on JIRA ?


This I am not sure, but did you try ant clean; ant? It may be a
version mismatch.


-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
Hi Dennis,

On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote:
 If nobody has a problem with them I would like to commit the following
 issues in the next day or two:

 NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19)
 NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4)
 NUTCH-647: Resolve URLs tool
 NUTCH-665: Search Load Testing Tool
 NUTCH-667: Input Format for working with Content in Hadoop Streaming

 And I would like to commit these in  a week:

 NUTCH-635: LinkAnalysis Tool for Nutch
 NUTCH-646: New Indexing framework for Nutch
 NUTCH-594: Serve Nutch search results in XML and JSON
 NUTCH-666: Analysis plugins and new language identifier.

 There are others too but these are the ones I am trying to get moved into
 trunk right now.


I am OK with all but NUTCH-666... Why a new language identifier? (or
if a new one, why keep old one around?)

 Dennis




-- 
Doğacan Güney


Re: NUTCH-92

2008-11-27 Thread Doğacan Güney
Hi,

On Wed, Nov 26, 2008 at 3:04 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Hi all,

 After reading this paper:

 http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf

 I came up with the following idea of implementing global IDF in Nutch. The
 upside of the approach I propose is that it brings back the cost of making a
 search query to 1 RPC call. The downside is that the search servers need to
 cache global IDF estimates as computed by the DS.Client, which ties them to
 a single query front-end (DistributedSearch.Client), or requires keeping a
 map of client, globalIDFs on each search server.

 -

 First, as the paper above claims, we don't really need exact IDF values of
 all terms from every index. We should get acceptable quality if we only
 learn the top-N frequent terms, and for the rest of them we apply a
 smoothing function that is based on global characteristics of each index
 (such as the number of terms in the index).

 This means that the data that needs to be collected by the query integrator
 (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist
 of a list of e.g. top 500 local terms with their frequency, plus the local
 smoothing factor as a single value.

 We could further reduce the amount of data to be sent from/to shard servers
 by encoding this information in a counted Bloom filter with a single-byte
 resolution (or a spectral Bloom filter, whichever yields a better precision
 / bit in our case).

 The query integrator would ask all active shard servers to provide their
 local IDF data, and it would compute global IDFs for these terms, plus a
 global smoothing factor, and send back the updated information to each shard
 server. This would happen once per lifetime of a local shard, and is needed
 because of the local query rewriting (and expansion of terms from Nutch
 Query to Lucene Query).

 Shard servers would then process incoming queries using the IDF estimates
 for terms included in the global IDF data, or the global smoothing factors
 for terms missing from that data (or use local IDFs).

 The global IDF data would have to be recomputed each time the set of shards
 available to a DS.Client changes, and then it needs to be broadcast back
 from the client to all servers - which is the downside of this solution,
 because servers need to keep a cache of this information for every DS.Client
 (each of them possibly having a different list of shard servers, hence
 different IDFs). Also, as shard servers come and go, the IDF data keeps
 being recomputed and broadcast, which increases the traffic between the
 client and servers.

 Still I believe the amount of additional traffic should be minimal in a
 typical scenario, where changes to the shards are much less frequent than
 the frequency of sending user queries. :)

 --

 Now, if this approach seems viable (please comment on this), what should we
 do with the patches in NUTCH-92 ?

 1. skip them for now, and wait until the above approach is implemented, and
 pay the penalty of using skewed local IDFs.

 2. apply them now, and pay the penalty of additional RPC call / search, and
 replace this mechanism with the one described above, whenever that becomes
 available.


It seems I wrote the patch in NUTCH-92. My recollection was that you
wrote it, Andrzej :D
Anyway, I have no idea what I did in that patch, don't know if it
works or applies etc. Really,
I am just curios. Did anyone test it? Does it really work :) ?

I haven't read the paper yet but the proposed approach sounds better
to me. Do you have any
code ready, Andrzej? Or how difficult is it to implement it?

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
And here is a list of issues from me that needs more discussion/review:

NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
review for people, for now we can just write a SolrIndexer like Sami
Siren's and deal with 442 after 1.0. I would be happy to provide such
a patch.

NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
don't know how to fix this one but indexing almost always fails with
index-more enabled.

NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
fetch interval correctly: I botched it once so now I am afraid to
commit it :D

NUTCH-626 - fetcher2 breaks out the domain with
db.ignore.external.links set at cross domain redirects: I am going to
update the patch and commit it if no objections.

Also, I think NUTCH-658 would be a nice feature for 1.0.

There are some others but these are the most recent and we really
should push 1.0 out the door already :D

Oh and finally we should do a review of all libraries in nutch
(libraries in plugins included) and update them to latest versions. I
am going to open an issue with the intenton of updating all the
libraries that do not require code changes.

-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
I forgot: I think there is a huge bug with MapWritable in nutch. I
didn't yet figure out what it is
exactly but it has something to do with the fact that id-class maps are static.

On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
 And here is a list of issues from me that needs more discussion/review:

 NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
 review for people, for now we can just write a SolrIndexer like Sami
 Siren's and deal with 442 after 1.0. I would be happy to provide such
 a patch.

 NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
 don't know how to fix this one but indexing almost always fails with
 index-more enabled.

 NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
 fetch interval correctly: I botched it once so now I am afraid to
 commit it :D

 NUTCH-626 - fetcher2 breaks out the domain with
 db.ignore.external.links set at cross domain redirects: I am going to
 update the patch and commit it if no objections.

 Also, I think NUTCH-658 would be a nice feature for 1.0.

 There are some others but these are the most recent and we really
 should push 1.0 out the door already :D

 Oh and finally we should do a review of all libraries in nutch
 (libraries in plugins included) and update them to latest versions. I
 am going to open an issue with the intenton of updating all the
 libraries that do not require code changes.

 --
 Doğacan Güney




-- 
Doğacan Güney


Re: Pending Commits for Nutch Issues

2008-11-27 Thread Doğacan Güney
OK one last thing: Get rid of Fetcher and promote Fetcher2 to be the
default fetcher.

On Thu, Nov 27, 2008 at 7:15 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
 I forgot: I think there is a huge bug with MapWritable in nutch. I
 didn't yet figure out what it is
 exactly but it has something to do with the fact that id-class maps are 
 static.

 On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
 And here is a list of issues from me that needs more discussion/review:

 NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
 review for people, for now we can just write a SolrIndexer like Sami
 Siren's and deal with 442 after 1.0. I would be happy to provide such
 a patch.

 NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
 don't know how to fix this one but indexing almost always fails with
 index-more enabled.

 NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
 fetch interval correctly: I botched it once so now I am afraid to
 commit it :D

 NUTCH-626 - fetcher2 breaks out the domain with
 db.ignore.external.links set at cross domain redirects: I am going to
 update the patch and commit it if no objections.

 Also, I think NUTCH-658 would be a nice feature for 1.0.

 There are some others but these are the most recent and we really
 should push 1.0 out the door already :D

 Oh and finally we should do a review of all libraries in nutch
 (libraries in plugins included) and update them to latest versions. I
 am going to open an issue with the intenton of updating all the
 libraries that do not require code changes.

 --
 Doğacan Güney




 --
 Doğacan Güney




-- 
Doğacan Güney


Re: NUTCH-92

2008-11-27 Thread Doğacan Güney
On Thu, Nov 27, 2008 at 11:40 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:


 It seems I wrote the patch in NUTCH-92. My recollection was that you
 wrote it, Andrzej :D

 No, I didn't - you did! :) I only came up with the proposal, after
 discussing it with Doug.

 Anyway, I have no idea what I did in that patch, don't know if it
 works or applies etc. Really,
 I am just curios. Did anyone test it? Does it really work :) ?

 Not me. I shied away from the patch because I didn't like the 2 RPC-s per
 search. I still don't like it, but I may have to accept it as an interim
 solution.

 That was my question, really - for release 1.0:

 * are we better off not having this patch, and just be careful how we split
 indexes among searchers as we do it now, or

 * should we apply the patch, pay the price of 2 RPCs, and wait for the patch
 implementing the approach that I proposed?

 * or make an effort to implement the new approach, and postpone the release
 until this is ready.


3rd approach sounds the best, especially if new approach is not
difficult to implement.
(I may even give it a try if I have the time)



 I haven't read the paper yet but the proposed approach sounds better
 to me. Do you have any
 code ready, Andrzej? Or how difficult is it to implement it?

 No code yet, just thinking aloud. But it's not really anything complicated,
 chunks of code already exist that implement almost all building blocks of
 the algorithm.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: 1.0 Release?

2008-11-23 Thread Doğacan Güney
I agree with this list and have nothing new to add.

(Except, I guess people also want NUTCH-92 to be fixed)

On Thu, Nov 20, 2008 at 6:51 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Dennis Kubes wrote:

 What does everybody think of trying to do a Nutch 1.0 release in the next
 couple of weeks.  I have 8 different patches that are ready to be committed
 including:

 1) NUTCH-647: Resolve URLs tool
 2) NUTCH-635: LinkAnalysis Tool for Nutch
 3) NUTCH-646: New Indexing framework for Nutch
 4) NUTCH-594: Serve Nutch search results in XML and JSON
 5) Custom fields on index and plugins
 6) Upgrade Nutch to the most recent Hadoop version (18.2).
 7) Upgrade Nutch to the most recent Lucene version (2.4).
 8) Analysis plugins and improvments to analyzer factory for multiple
 languages per analysis plugin.  Language identifier.

 I am going to try to get those posted in the next couple of days and
 committed in the next week.  Are there other major improvements we want to
 put in before trying to do a 1.0 release for Nutch?  Thoughts and
 suggestions?

 A few recently opened ones that should be easy to fix:

 NUTCH-661errors when the uri contains space characters
 NUTCH-657Estonian N-gram profile has wrong name
 NUTCH-652AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
 fetch interval correctly
 NUTCH-644RTF parser doesn't compile anymore
 NUTCH-643ClassCastException in PdfParser on encrypted PDF with empty
 password
 NUTCH-636Http client plug-in https doesn't work on IBM JRE
 NUTCH-631MoreIndexingFilter fails with NoSuchElementException
 NUTCH-626fetcher2 breaks out the domain with
 db.ignore.external.links set at cross domain redirects
 NUTCH-566Sun's URL class has bug in creation of relative query URLs
 NUTCH-542Null Pointer Exception on getSummary when segment no longer
 exists
 NUTCH-531Pages with no ContentType cause a Null Pointer exception

 And of course this one:

 NUTCH-442Integrate Solr/Nutch


 We should also review all other open issues marked as Blocker / Major,
 especially those with patches, and take some action - either fix them, or
 won't fix 'em, or postpone to the next release (the single Blocker issue
 should be fixed).


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Help needed in Integrating a module

2008-09-28 Thread Doğacan Güney
On Sat, Sep 27, 2008 at 10:32 PM, Nimesh Priyodit [EMAIL PROTECTED] wrote:
 Hi,
 Recently i have developed my own stemmer.
 Can you please tell me how to integrate the module which i wrote, into
 nutch?

Where exactly do you want to integrate it? Into indexing?

 Regards,
 Nimesh





-- 
Doğacan Güney


Re: Crawled documents in readable format

2008-09-28 Thread Doğacan Güney
On Sat, Sep 27, 2008 at 9:24 PM, Allan Avendaño
[EMAIL PROTECTED] wrote:

 Hi to all!

 I would like to get the nutch crawled documents within readable format,
 How could I do that?


You can try readseg tool, i.e

bin/nutch readseg -dump segment output -nofetch -noparse
-nogenerate -noparsedata -nocontent

This will give you parsed text of segments.

 Thanks for ur help


 --
 

 Allan Roberto Avendaño Sudario
 Guayaquil-Ecuador
 Home   :   +593(4) 2800 692
 Office :   +593(4) 2269 268
 + MSN-Messenger: [EMAIL PROTECTED]
 + Gmail: [EMAIL PROTECTED]





-- 
Doğacan Güney


Re: Droids crawler

2008-09-20 Thread Doğacan Güney
On Fri, Sep 12, 2008 at 5:38 PM, Dennis Kubes [EMAIL PROTECTED] wrote:
 Interesting.  Worth a deeper look I think.  I think one of the keys to a new
 version of nutch would be crawler extensibility.


I agree. So let's start a discussion then. What is missing from nutch's crawler?
What does droids do that we don't?

 Dennis

 Andrzej Bialecki wrote:

 Hi all,

 In the light of discussion about the future of Nutch I'd lie to draw your
 attention to Droids - a small crawler framework that uses Spring for
 extensibility.

 http://people.apache.org/~thorsten/droids/

 Are there any lessons there that we could learn?





-- 
Doğacan Güney


Re: Next release?

2008-02-20 Thread Doğacan Güney
Hi,

On Tue, Feb 19, 2008 at 11:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi all,

 I propose to start planning for the next release, and tentatively I
 propose to schedule it for the beginning of April.

 I'm going to close a lot of old and outdated issues in JIRA - other
 committers, please do the same if you know that a given issue no longer
 applies.


There are some issues I want to put in before a release. Most are trivial
but I would like to draw attention to NUTCH-442, as it is an issue that I
(and looking at its votes, others) want to see resolved before another
release. I really could use some review and suggestions there (well, I guess
I am partly to blame since I failed to update the patch after Enis's
comments).




 Out of the remaining open issues, we should resolve all with the blocker
 / major status, and of the type bug. Then we can resolve as many as we
 can from the remaining categories, depending on the votes and perceived
 importance of the issue.

 Any other suggestions?

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Backwards compatibility strategy

2007-11-23 Thread Doğacan Güney
Hi,

On Nov 22, 2007 7:45 PM, Sami Siren [EMAIL PROTECTED] wrote:
 Hello all,

 Currently there are many places in Nutch that tries to handle older
 formats of serialized data. This (at least in longer run) will make the
 code harder to understand, harder to test and harder to maintain.

 IMO it would be more clean to offer conversion with separate tools (like
CrawlDbConverter) and keep the rest of the code clean from such
 functionality. Opinions?

I disagree. Posts on nutch-user show that people are confused when we
break compatibility. If backward compatibility code within other code
is getting messy, then we can use conversion tools but they should be
transparent to regular user. For example, before a nutch job runs a
small program can check if any conversion needs to be applied (this
program can check comptaibility by reading a few records of a segment)
then print a warning and first run this conversion job then run
requested job.


 I personally favor starting from scratch when switching version but
 probably there are users who wish to convert older data or are there?

 --
  Sami Siren




-- 
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #261

2007-11-09 Thread Doğacan Güney
] symbol  : constructor 
 Outlink(java.lang.String,java.lang.String,org.apache.hadoop.conf.Configuration)
 [javac] location: class org.apache.nutch.parse.Outlink
 [javac]   outlinks[i] = new Outlink(http://outlink.com/; + i, 
 Outlink + i, conf);
 [javac] ^
 [javac] 
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/test/org/apache/nutch/util/TestFibonacciHeap.java
  :65: cannot find symbol
 [javac] symbol  : class FibonacciHeap
 [javac] location: class org.apache.nutch.util.TestFibonacciHeap
 [javac] FibonacciHeap h= new FibonacciHeap();
 [javac] ^
 [javac] 
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/test/org/apache/nutch/util/TestFibonacciHeap.java
  :65: cannot find symbol
 [javac] symbol  : class FibonacciHeap
 [javac] location: class org.apache.nutch.util.TestFibonacciHeap
 [javac] FibonacciHeap h= new FibonacciHeap();
 [javac]  ^
 [javac] Note: 
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
   uses or overrides a deprecated API.
 [javac] Note: Recompile with -Xlint:deprecation for details.
 [javac] Note: Some input files use unchecked or unsafe operations.
 [javac] Note: Recompile with -Xlint:unchecked for details.
 [javac] 5 errors

 BUILD FAILED
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build.xml
  :217: Compile failed; see the compiler error output for details.

 Total time: 52 seconds
 Publishing Javadoc
 Recording test results
 No test report files were found. Configuration error?
 Updating NUTCH-548
 Updating NUTCH-494
 Updating NUTCH-547
 Updating NUTCH-538





-- 
Doğacan Güney


Re: JIRA emails and Nutch

2007-11-05 Thread Doğacan Güney
On Nov 4, 2007 8:36 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Dennis Kubes wrote:
  I don't think JIRA emails are being sent out for Nutch.  Changes from
  yesterday and today have yet to be mailed out.  Commit emails are being
  mailed.  Is this something that we send to infrastructure?

 I think so. Speaking of which, I noticed that I also stopped getting
 commit messages, which is double strange ... I'll try to subscribe
 manually and see what happens.

Any progress on this? I was thinking of committing/resolving some
issues in JIRA but I want to wait until emails start working.



 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Next move with JIRA ticket

2007-10-31 Thread Doğacan Güney
Hi,

On 10/31/07, Ned Rockson [EMAIL PROTECTED] wrote:
 I submitted a JIRA ticket regarding URL ordering in Generator.java as
 well as a patch (NUTCH-570) and I'm wondering what else I need to do to
 get this committed.  Obviously it's low priority so I may be getting too
 antsy.


Since NUTCH-570 tracks a non-trivial change and nutch development  is
a bit slow these days, it may be a while before someone can review
your patch and make a comment on it. Personally, I have been meaning
to take a look at your patch, but I have been too lazy^Wbusy lately.

What you can do is that, for example, you can send some statistics
regarding overhead of running two extra jobs or fetch performance
increase as a result of smarter url ordering. Again personally, I find
that patches with such numbers and test cases are a lot easier to
review (thus, easier to commit:).

-- 
Doğacan Güney


Re: Adding new class to nutch

2007-10-29 Thread Doğacan Güney
Hi,

On 10/29/07, eyal edri [EMAIL PROTECTED] wrote:
 Hi,

 i'm interested in adding a new class on my own to nutch, to allow a few
 config needed to our application (such as reading config file, etc.)
 i've written a new java class called: LabConf.java and placed it in the
 $NUTCH_HOME/src/java/org/apache/nutch/util dir.

 after running ant i didnt see any messages indicating that this code was
 added to the project.
 can anyone tell me where i need to tell nutch to regard to this new class?

New class should be added. By default, ant compiles everything under
src/java/org/apache/nutch.


 thanks

 --
 Eyal Edri



-- 
Doğacan Güney


Re: First Plugin

2007-10-05 Thread Doğacan Güney
Hi,

On 10/5/07, Sagar Vibhute [EMAIL PROTECTED] wrote:
 Hi,

 I have recently downloaded and used nutch and I need to develop a few
 plugins for my work. I took the plugin example given on the wiki,

 http://wiki.apache.org/nutch/WritingPluginExample-0%2e9

 and followed the instructions as given there. Now when I start crawling
 again it aborts and throws the following exception:

 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

 I could crawl successfully before I added this plugin.

 Please give any insights you can to get this fixed.

(I really should add this to FAQ)

This log doesn't help us. This simply tells us that crawling has
failed. You have to check your logs elsewhere (logs/hadoop.log
directory if you are local and your tasktracker's logs if you are
running in distributed mode). If you can send those logs we can make a
more informed analysis about your problem.


 Thank You!

 - Sagar



-- 
Doğacan Güney


Re: First Plugin

2007-10-05 Thread Doğacan Güney
.

OK, it seems you have removed scoring-opic plugins (and  other scoring
plugins if you have any) by accident. You should check your
plugin.includes option in nutch-site.xml, there is probably something
wrong with that. Perhaps, you put a new line there?


 - Sagar



-- 
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #221

2007-09-29 Thread Doğacan Güney
 org.apache.nutch.metadata.TestMetadata
 [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.364 sec
 [junit] Running org.apache.nutch.metadata.TestSpellCheckedMetadata
 [junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 13.652 sec
 [junit] Running org.apache.nutch.net.TestURLFilters
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.91 sec
 [junit] Running org.apache.nutch.net.TestURLNormalizers
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.909 sec
 [junit] Running org.apache.nutch.ontology.TestOntologyFactory
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.262 sec
 [junit] Running org.apache.nutch.parse.TestOutlinkExtractor
 [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.972 sec
 [junit] Running org.apache.nutch.parse.TestParseData
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.393 sec
 [junit] Running org.apache.nutch.parse.TestParseText
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.348 sec
 [junit] Running org.apache.nutch.parse.TestParserFactory
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.376 sec
 [junit] Running org.apache.nutch.plugin.TestPluginSystem
 [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 2.486 sec
 [junit] Running org.apache.nutch.protocol.TestContent
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.602 sec
 [junit] Running org.apache.nutch.protocol.TestProtocolFactory
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.241 sec
 [junit] Running org.apache.nutch.searcher.TestHitDetails
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.297 sec
 [junit] Running org.apache.nutch.searcher.TestOpenSearchServlet
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.328 sec
 [junit] Running org.apache.nutch.searcher.TestQuery
 [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.758 sec
 [junit] Running org.apache.nutch.searcher.TestSummarizerFactory
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.13 sec
 [junit] Running org.apache.nutch.searcher.TestSummary
 [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.446 sec
 [junit] Running org.apache.nutch.util.TestEncodingDetector
 [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.667 sec
 [junit] Test org.apache.nutch.util.TestEncodingDetector FAILED

Heh, another java5/java6 problem (Charset.isSupported(utf-32) is
false for java 5, true for java 6). I have made another commit,
hopefully everything will be OK this time.

 [junit] Running org.apache.nutch.util.TestFibonacciHeap
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.115 sec
 [junit] Running org.apache.nutch.util.TestGZIPUtils
 [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.293 sec
 [junit] Running org.apache.nutch.util.TestNodeWalker
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.84 sec
 [junit] Running org.apache.nutch.util.TestPrefixStringMatcher
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.364 sec
 [junit] Running org.apache.nutch.util.TestStringUtil
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.231 sec
 [junit] Running org.apache.nutch.util.TestSuffixStringMatcher
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.357 sec
 [junit] Running org.apache.nutch.util.TestURLUtil
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.39 sec
 [junit] Running org.apache.nutch.util.mime.TestMimeType
 [junit] Tests run: 17, Failures: 0, Errors: 0, Time elapsed: 0.246 sec
 [junit] Running org.apache.nutch.util.mime.TestMimeTypes
 [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.695 sec

 BUILD FAILED
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build.xml
  :297: Tests failed!

 Total time: 4 minutes 12 seconds
 Publishing Javadoc
 Recording test results




-- 
Doğacan Güney


Re: Scoring API issues (LONG)

2007-09-19 Thread Doğacan Güney
On 9/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
  config);
  public void prepareGeneratorConfig(Path crawlDb, Configuration config);
  public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]
  segments, Configuration config);
  public void prepareUpdateConfig(Path crawlDb, Path[] segments,
  Configuration config);
 
  Should we really pass Path-s to methods? IMHO, opening a file and
  reading from it looks a bit cumbersome. I would suggest that the
  relevant job would read the file then pass the data (MapWritable) to
  the method. For example, prepareGeneratorConfig would look like this:
 
  public void prepareGeneratorConfig(MapWritable crawlDbMeta,
  Configuration config);

 What about the segment's metadata in prepareUpdateConfig? Following your
 idea, we would have to pass a MapString segmentName, MapWritable
 metaData ...

Yeah, I think it looks good but I guess you disagree?


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Host-level stats, ranking and recrawl

2007-09-18 Thread Doğacan Güney
Hi,

On 9/17/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Hi,

 I was recently reading again some scoring-related papers, and found some
 interesting data in a paper by Baeza-Yates et al, Crawling a Country:
 Better Strategies than Breadth-First for Web Page Ordering
 (http://citeseer.ist.psu.edu/730674.html).

 This paper compares various strategies for prioritizing a crawl of
 unfetched pages. Among others, it compared the OPIC scoring and a simple
 strategy which is called large sites first. This strategy prioritizes
 pages from large sites and deprioritizes pages from small / medium
 sites. In order to measure the effectiveness the authors used the value
 of accumulated PageRank vs. the percentage of crawled pages - the
 strategy that ensures quick ramp-up of aggregate pagerank is the best.

 A bit surprisingly, they found that large-sites-first wins over OPIC:

 Breadth-first is close to the best strategies for the first 20-30% of
 pages, but after that it becomes less efficient.
   The strategies batch-pagerank, larger-sites-first and OPIC have better
 performance than the other strategies, with an advantage towards
 larger-sites-first when the desired coverage is high. These strategies
 can retrieve about half of the Pagerank value of their domains
 downloading only around 20-30% of the pages.

 Nutch currently uses OPIC-like scoring for this, so most likely it
 suffers from the same symptoms (the authors also mention a relatively
 poor OPIC performance at the beginning of a crawl).

 Nutch doesn't collect at the moment any host-level statistics, so we
 couldn't use the other strategy even if we wanted.

 What if we added a host-level DB to Nutch? Arguments against this: it's
 an additional data structure to maintain, and this adds complexity to
 the system; it's an additional step in the workflow (- it takes longer
 time to complete one cycle of crawling). Arguments for are the
 following: we could implement the above scoring method ;), plus the
 host-level statistics are good for detecting spam sites, limiting the
 crawl by site size, etc.

Another +1. We definitely need domain-level statistics anyway, so
being able to implement large-sites-first is a nice bonus, I think :)


 We could start by implementing a tool to collect such statistics from
 CrawlDb - this should be a trivial map-reduce job, so if anyone wants to
 take a crack at this it would be a good exercise ... ;)

 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Scoring API issues (LONG)

2007-09-18 Thread Doğacan Güney
 segment. Each operation that changes
 a db or a segment would update this information.

 In practial terms, I propose to add static methods to CrawlDbReader,
 LinkDbReader and SegmentReader, which can retrieve and / or update this
 information.

 3. Initialization of scoring plugins with global information
 
 Current scoring API works only with local properties of the page (I'm
 not taking into account plugins that use external information sources -
 that's outside of the scope of the API). It doesn't have any built-in
 facilities to collect and calculate global properties useful for PR or
 HITS calculation, such as e.g. the number of dangling nodes (ie. pages
 without outlinks), their total score, the number of inlinks, etc. It
 doesn't have the facility to output this collected global information at
 the end of the job. Neither has it any facility to initialize scoring
 plugins with such information if one exists.

 I propose to add the following methods to scoring plugins, so that they
 can modify the job configuration right before the job is started, so
 that later on the plugins could use this information when scoring
 filters are initialized in each task. E.g:

 public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
 config);
 public void prepareGeneratorConfig(Path crawlDb, Configuration config);
 public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]
 segments, Configuration config);
 public void prepareUpdateConfig(Path crawlDb, Path[] segments,
 Configuration config);

Should we really pass Path-s to methods? IMHO, opening a file and
reading from it looks a bit cumbersome. I would suggest that the
relevant job would read the file then pass the data (MapWritable) to
the method. For example, prepareGeneratorConfig would look like this:

public void prepareGeneratorConfig(MapWritable crawlDbMeta,
Configuration config);


 Example: to properly implement the OPIC scoring, it's necessary to
 collect the total number of dangling nodes, and the total score from
 these nodes. Then, in the next step it's necessary to spread this total
 score evenly among all other nodes in the crawldb. Currently this is not
 possible unless we run additional jobs, and create additional files to
 keep this data around between the steps. It would be more convenient to
 keep this data in CrawlDb metadata (see above) and make relevant values
 available in the job context (Configuration).


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #203

2007-09-11 Thread Doğacan Güney
]^
 [javac] 3 errors

 BUILD FAILED

Hmm, I can compile nutch successfully with Java 6 but not with Java 5.
Is there an override annotation change between java 5 and java 6?

 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build.xml
  :112: The following error occurred while executing this line:
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/build.xml
  :76: The following error occurred while executing this line:
 http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/build-plugin.xml
  :111: Compile failed; see the compiler error output for details.

 Total time: 47 seconds
 Publishing Javadoc
 Recording test results
 No test report files were found. Configuration error?
 Updating NUTCH-550
 Updating NUTCH-546




-- 
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #203

2007-09-11 Thread Doğacan Güney
On 9/11/07, Susam Pal [EMAIL PROTECTED] wrote:
 Is it that the interface 'org.apache.nutch.net.URLFilter' was compiled
 with JDK 1.5 earlier? I have seen this problem happening with a beta
 version of JDK 1.6.

No, it still happens with an ant clean; ant. The problem seems to be
that Java 5 errs on override annotations for implemented methods
while java 6 is OK with them. Both are ok with override for extended
methods.


 Are you using the latest version, JDK 1.6 Update 2?

$ java -version
java version 1.6.0_02
Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
Java HotSpot(TM) Client VM (build 1.6.0_02-b05, mixed mode, sharing)

Anyways, I am going to commit a small fix that removes override
annotations so that code can be compiled.


 Regards,
 Susam Pal
 http://susam.in/

 On 9/11/07, Doğacan Güney [EMAIL PROTECTED] wrote:
  On 9/11/07, [EMAIL PROTECTED]
  [EMAIL PROTECTED] wrote:
   See 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/203/changes
  
   Changes:
  
   [dogacan] NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1.
  
   [dogacan] NUTCH-546 - file URL are filtered out by the crawler.
  
   --
   [...truncated 4410 lines...]
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/classes
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/test
  
   init-plugin:
  
   deps-jar:
  
   compile:
[echo] Compiling plugin: tld
   [javac] Compiling 2 source files to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/classes
  
   jar:
 [jar] Building jar: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/tld.jar
  
   deps-test:
  
   deploy:
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/tld
[copy] Copying 1 file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/tld
  
   copy-generated-lib:
[copy] Copying 1 file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/tld
   Overriding previous definition of reference to plugin.deps
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/test/data
[copy] Copying 6 files to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/test/data
  
   init:
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/classes
   Overriding previous definition of reference to plugin.deps
  
   init-plugin:
  
   deps-jar:
  
   init:
  
   init-plugin:
  
   deps-jar:
  
   compile:
[echo] Compiling plugin: lib-regex-filter
  
   jar:
  
   init:
  
   init-plugin:
  
   deps-jar:
  
   compile:
[echo] Compiling plugin: lib-regex-filter
  
   compile-test:
   [javac] Compiling 1 source file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/lib-regex-filter/test
   [javac] Note: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
 uses unchecked or unsafe operations.
   [javac] Note: Recompile with -Xlint:unchecked for details.
  
   compile:
[echo] Compiling plugin: urlfilter-automaton
   [javac] Compiling 1 source file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/classes
  
   jar:
 [jar] Building jar: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/urlfilter-automaton.jar
  
   deps-test:
  
   init:
  
   init-plugin:
  
   deps-jar:
  
   compile:
[echo] Compiling plugin: lib-regex-filter
  
   jar:
  
   deps-test:
  
   deploy:
  
   copy-generated-lib:
  
   deploy:
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton
[copy] Copying 1 file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton
   Overriding previous definition of reference to plugin.deps
  
   copy-generated-lib:
[copy] Copying 1 file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton
[copy] Copying 1 file to 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton
  
   init:
   [mkdir] Created dir: 
   http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk

Re: Limiting outlink tags.

2007-09-07 Thread Doğacan Güney
Hi Marcin,

On 9/7/07, Marcin Okraszewski [EMAIL PROTECTED] wrote:
 Hi,
 I have noticed that Nutch considers img/@src as an outlink. I suppose in many 
 cases people do not want to threat image as an outlink. At least I don't 
 want. The same case is with script/@src. But, it seems there is no way to 
 limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all 
 a,area,form,frame,iframe,script,link,img. Only form element can be turned 
 off by parser.html.form.use_action parameter.

 I would suggest to introduce a new configuration parameter which could be 
 used to turn on or off certain elements. It could be simply done by single 
 parameter, which would contain coma separated list of tags to be turned off.

 What is your opinion? If you think it is a valid issue I can make a patch for 
 this.

There is already NUTCH-488 open for this (with a patch). Feel free to
add comments/patches/etc. there. Btw, I agree that using a CSV is
better than using a new configuration parameter for every tag.


 Regards,
 Marcin




-- 
Doğacan Güney


Re: bug with generate performance

2007-09-07 Thread Doğacan Güney
Hi,

On 8/31/07, misc [EMAIL PROTECTED] wrote:

 Hello-

 I am almost certain I have found a nasty bug with nutch genereate.

 Problem: Nutch generate can take many hours, even a day to complete (on a 
 crawldb that has less than 2 million urls).

 I added debug code to Generator-Selector.map to see when map is called 
 and returns, and observed interesting behavior, described here:

 1. Most of the time, when generate is run urls are processed in chunky 
 batches, usually about 40 at a time, followed by a 1 second delay.  I timed 
 the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) 
  When this happens it takes hours to complete.

 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls 
 are processed without delays.  It is an all or nothing event, either I run 
 and all urls process quickly without delay (in minutes), or more likely I get 
 the chunky processing with many 1 second delays and the program takes hours 
 to end.  The one exception is

 3. When the processing runs quickly I've seen the main thread end (I have 
 some profiling going, so I know when a thread ends), and then more likely 
 than not a second thread begins where the first starts, chunky like usual.  
 Although I sometimes can get fast processing in one thread, it is almost 
 impossible for me te get it in all threads and therefore general processing 
 is very slow (hours).

 4. I tried to put in more debug code to find the line where the delays 
 occured, but the last line printed to the log at a delay seemed random, 
 leading me to believe that the log is not being flushed uniformly.

 5. The profiler I used seemed to imply that about 100% of the time was 
 spent in javallang.Thread.sleep.  I am not completely familiar with the 
 profiler I used so I am not completely sure I inturpreted this correctly.

 I will keep debugging here, but perhaps someone here has some insight 
 into what might be happening?

Others have also reported a problem with generate performance. It
seems we have a problem here but I can not reproduce this behaviour so
I am not sure what causes it. Can you open a JIRA issue and enter your
comments there? Also, how you are running generate will be very
helpful (what is generate.max.per.host? what is -topN argument, etc.)


 thanks
 -J


-- 
Doğacan Güney


Re: ant test failures

2007-09-01 Thread Doğacan Güney
On 8/31/07, Christopher Bader [EMAIL PROTECTED] wrote:
 H,



 I'm new to this list.



 I checked the whole nutch source tree yesterday, then I ran ant and ant
 test in the trunk and the three branches.



 Ant succeeded in all four cases, but ant test succeeded only on the 0.7
 branch.  In other words, ant test failed on the trunk and on the 0.8 and
 0.9 branches.


One of the plugins fails with Java 1.6 (I think it is parse-swf, but I
am not sure). This is known bug. Tests should pass successfully with
1.5.



 Is this the expected result?  Or am I doing something wrong?  I'm running
 Java 1.6 and Ant 1.7.



 CB








-- 
Doğacan Güney


Re: Redirects and alias handling (LONG)

2007-08-21 Thread Doğacan Güney
.
 -
 This issue has been briefly discussed in NUTCH-353. Inlink information
 should be merged so that all link information from all aliases is
 aggregated, so that it points to a selected canonical target URL.

We should also merge their score. If example.com (with score 4.0) is
an alias for www.example.com (with score 8.0), the selected url (which
I think, as I said before, should be www.example.com)  should end up
with the score 12.0. We may not want to do this for aliases in
different domains but I think we should definitely do this if two urls
with the same content are under the same domain (like example.com).


 See also above sample queries from Google.


 B. Design and implementation
 

 In order to select the correct canonical URL at each stage in
 redirection handling we should keep the accumulated redirection path,
 which includes source URLs and redirection methods (temporary/permanent,
 protocol or content-level redirect, redirect delay). This way, when we
 arrive a the final page in the redirection path, we should be able to
 select the canonical path.

 We should also specify which intermediate URL we accept as the current
 canonical URL in case we haven't yet reached the end of redirections
 (e.g. when we don't follow redirects immediately, but only record them
 to be used in the next cycle).

 We should introduce an alias status in CrawlDb and LinkDb, which
 indicates that a given URL is a non-canonical alias of another URL. In
 CrawlDb, we should copy all accumulated metadata and put it into the
 target canonical CrawlDatum. In LinkDb, we should merge all inlinks
 pointing to non-canonical URLs so that they are assigned to the
 canonical URL. In both cases we should still keep the non-canonical URLs
 in CrawlDb and LinkDb - however we could decide not to keep any of the
 metadata / inlinks there, just an alias flag and a pointer to the
 canonical URL where all aggregated data is stored. CrawlDb and
 LinkDbReader may or may not hide this fact from their users - I think it
 would be more efficient if users of this API would get the final
 aggregated data right away, perhaps with an indicator that it was
 obtained using a non-canonical URL ...

 Regarding Lucene indexes - we could either duplicate all data for each
 non-canonical URL, i.e. create as many full-blown Lucene documents as
 many there are aliases, or we could create special redirect documents
 that would point to a URL which contains the full data ...

We can avoid doing both. Let's assume A redirects to B, C also
redirects to B and B redirects to D. After the fetch/parse/updatedb
cycle that processes D we would probably have enough data to choose
the 'canonical url' (let's assume that canonical is B). Then during
Indexer's reduce we can just index parse text and parse data (and
whatever else) of D under url B since we won't index B (or A or C) as
itself (it doesn't have any useful content after all).



 That's it for now ... Any comments or suggestions to the above are welcome!

Andrzej, have you written any code? I would suggest that we open a
JIRA and have some code (no matter how much half-baked it is) as soon
as we can.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Redirects and alias handling (LONG)

2007-08-21 Thread Doğacan Güney
On 8/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  If the same content is available under multiple urls, I think it makes
  sense to assume that the url with the highest score should be 'the
  representative'  url.

 Not necessarily - it depends how you defined your score.
 http://www.ibm.com/ may actually have a low score, because it
 immediately redirects to http://www.ibm.com/index.html (actually, it
 redirects to http://www.ibm.com/us/index.html).

 Also, the shortest url wins rule is not always true. Let's say I own a
 domain a.biz, and I made a Wikipedia mirror there. Which of the pages is
 more representative: http://a.biz/About_Wikipedia or
 http://www.wikipedia.org/en/About_Wikipedia ?


  3. Link and anchor information for aliases and redirects.
  -
  This issue has been briefly discussed in NUTCH-353. Inlink information
  should be merged so that all link information from all aliases is
  aggregated, so that it points to a selected canonical target URL.
 
  We should also merge their score. If example.com (with score 4.0) is
  an alias for www.example.com (with score 8.0), the selected url (which
  I think, as I said before, should be www.example.com)  should end up
  with the score 12.0. We may not want to do this for aliases in
  different domains but I think we should definitely do this if two urls
  with the same content are under the same domain (like example.com).

 I think you are right - at least with the OPIC scoring it would work ok.


 
  Regarding Lucene indexes - we could either duplicate all data for each
  non-canonical URL, i.e. create as many full-blown Lucene documents as
  many there are aliases, or we could create special redirect documents
  that would point to a URL which contains the full data ...
 
  We can avoid doing both. Let's assume A redirects to B, C also
  redirects to B and B redirects to D. After the fetch/parse/updatedb
  cycle that processes D we would probably have enough data to choose
  the 'canonical url' (let's assume that canonical is B). Then during
  Indexer's reduce we can just index parse text and parse data (and
  whatever else) of D under url B since we won't index B (or A or C) as
  itself (it doesn't have any useful content after all).

 Hmm. The index should somehow contain _all_ urls, which point to the
 same document. I.e. when you search for url http://example.com; it
 should ideally return exactly the same Lucene document as when you
 search for http://www.example.com/index.html;.

Why would you do a search with the full name of the url? I also don't
understand why we need to have all urls in index (we already eliminate
near-duplicates with dedup).  I guess I am missing your use case
here...


 Similarly, the inlink information for all aliased urls should be the
 same (but in our case it's not a Lucene issue, only the LinkDb aliasing
 issue).

I agree with you here.



 
 
  That's it for now ... Any comments or suggestions to the above are welcome!
 
  Andrzej, have you written any code? I would suggest that we open a
  JIRA and have some code (no matter how much half-baked it is) as soon
  as we can.

 Not yet - I'll open the issue and put these initial thoughts there.


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Redirects and alias handling (LONG)

2007-08-21 Thread Doğacan Güney
On 8/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  Hmm. The index should somehow contain _all_ urls, which point to the
  same document. I.e. when you search for url http://example.com; it
  should ideally return exactly the same Lucene document as when you
  search for http://www.example.com/index.html;.
 
  Why would you do a search with the full name of the url? I also don't
  understand why we need to have all urls in index (we already eliminate
  near-duplicates with dedup).  I guess I am missing your use case
  here...

 Let's say I'm searching for test and I want to limit the search to a
 particular url. I enter a query:

 test url:example.com

 It should yield the same results as for the following query:

 test url:www.example.com

 (assuming they are aliases).


I guess we can do something like this (continuing from my example
above): Index D's data under B then add a alias field to the lucene
document with A, C and D in it. Then change query-url so that a url:
query also searches the alias field.


 Another, more realistic example: I'm searching for IBM products. So I
 enter a query:

 products site:ibm.com

 This should yield the same results as any of the following:

 products site:www.ibm.com
 products site:www-128.ibm.com
 products site:www-304.ibm.com


Thanks for the explanation.

How do we know that www.ibm.com and www-128.ibm.com hosts are perfect
mirrors of one another? All we can know is that http://www.ibm.com/
and http://www-128.ibm.com/ *urls* are aliases of one another and that
for the urls that we have fetched *so far* they seem to mirror each
other. It is possible that the next URL we fetch from one of those
sites does not exist in the other. I don't think that we can ever be
certain that they are perfect mirrors of each other so, IMHO, we
shouldn't treat those queries as same. Google also doesn't return the
same results for products site:www.ibm.com products
site:www-128.ibm.com .

(One small unrelated note: As discussed in NUTCH-439 and NUTCH-445, we
should treat site:ibm.com as all hosts under domain ibm.com even if
http://www.ibm.com/ and http://ibm.com/ are perfect mirrors of each
other.)


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Doğacan Güney


Re: Is there any chance that my patches will be considered?

2007-08-08 Thread Doğacan Güney
On 8/8/07, Marcin Okraszewski [EMAIL PROTECTED] wrote:
 Hello Nutch Developers,
 On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next 
 release is probably coming soon. I would be really pleased if they are merged 
 by then I don't have to patch it. Is there any chance they will be merged 
 into source code? I can port them to current head, but so far nobody asked 
 for this.

NUTCH-487 is mostly a duplicate of NUTCH-369. I merged patches from
both issues in NUTCH-25 since NUTCH-25 needs those patches to work
correctly (and I gave you and Renaud Richardet credit in comments).

As for NUTCH-490, I haven't taken an in-depth look at it, but I don't
see the point of it. Why not just use HtmlParseFilters since you have
access to the DOM object? What advantage do neko filters have? Also,
having an extension point for a library possibly used by a possibly
used plugin looks really really wrong from a design point.


 I would also like to point your attention to one point. It is already two and 
 half month since I have added the patches. There is not even a single comment 
 on this. This is really discouraging for me, as a contributor. I know that 
 merging patches is not the thing that developers love to do, but you are the 
 only one who can do it. Of course I don't mean you should thank to every 
 contribution, but take it into account. Having someone's work being ignored, 
 and it looks like this for me, really discourages from further work. 
 Reviewing it and saying you won't merge it because something would be much 
 better than leaving it without a single comment. This may reduce your active 
 community.

 Think of this.

 Best regards,
 Marcin Okraszewski



-- 
Doğacan Güney


Re: [jira] Commented: (NUTCH-527) MapWritable doesn't support all hadoops writable types

2007-07-25 Thread Doğacan Güney

On 7/25/07, Robert Young [EMAIL PROTECTED] wrote:

The message which was appearing in the logs is pasted below.

Basically, in org.apache.nutch.crawl.MapWritable#getKeyValueEntry the
Writable is instantiated. It's class is determined by a two byte code
(which is written to crawldb I guess), if there is no entry for the
class it fails to create it, regardless of if it's a Writable. You're
right in that it can potentially handle any Writabl object, but only
if it has a maping for it's class.


If you add a writable that does not have a mapping, MapWritable
automatically creates it and then stores the mapping internally. When
a MapWritable is written, any new mapping is also written (as a byte
and the corresponding class name). So, when you read a MapWritable it
first reads all the mappings (that are not already statically defined)
then proceeds to reading the Writable,Writable map. So I think your
problem is caused by something else (perhaps, there is a bug in
MapWritable's implementation but that is what that code is trying to
do).

Also, replying to JIRA generated emails does not add comments to
issues (despite what the email is saying). So please use JIRA to
reply.



Cheers
Rob

07/07/25 11:52:00 WARN crawl.MapWritable: Unable to load meta data
entry, ignoring.. : java.io.IOException: unable to load class for id:
36

On 7/25/07, Doğacan Güney (JIRA) [EMAIL PROTECTED] wrote:

 [ 
https://issues.apache.org/jira/browse/NUTCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515283
 ]

 Doğacan Güney commented on NUTCH-527:
 -

 What was the error you were having? MapWritable supports reading and writing *all* 
writables. The ones defined at the top of the file are an optimization and shouldn't affect 
correctness (basically mapwritable first writes a byte and the associated classname, then 
writes that byte to indicate classname everywhere else. For commonly used types we 
statically define the association so that first write the byte then the 
classname phase is not necessary).

  MapWritable doesn't support all hadoops writable types
  --
 
  Key: NUTCH-527
  URL: https://issues.apache.org/jira/browse/NUTCH-527
  Project: Nutch
   Issue Type: Bug
 Affects Versions: 0.9.0
  Environment: Tested on Solaris and Windows with Java 1.5
 Reporter: Rob Young
  Attachments: mapwritable.patch
 
 
  The map of classes which implement org.apache.hadoop.io.Writable is not 
complete. It does not, for example, include org.apache.hadoop.io.BooleanWritable. I 
would happily provide a patch if someone would explain what the Byte parameter is.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.






--
Doğacan Güney


Re: OOM error during parsing with nekohtml

2007-07-17 Thread Doğacan Güney

Hi,

On 7/17/07, Shailendra Mudgal [EMAIL PROTECTED] wrote:

Hi all,

Thanks for your suggestions.

I am running parse on a single url (
http://www.fotofinity.com/cgi-bin/homepages.cgi). For other urls, parse
works perfectly. we are getting this error because of the html of the page.
The page contains many anchor tags which are not closed properly. Hence neko
html parser throws this exception. The page can be parsed successfully using
tagsoup. We think this as a bug in neko html parser.


Since tagsoup works and neko doesn't, I agree with you that this is a
bug with neko.

If you want to skip over this page (parser will not extract text from
this page but parsing will successfully run overall), you may try
changing catch clause in ParseSegment. java:77 from Exception to
Throwable. This should catch OOM and continue.




Regards,
Shailendra







On 7/16/07, Tsengtan A Shuy [EMAIL PROTECTED] wrote:

 Thank you for the info.
 The OOM exception in your previous email indicates that your system is
 running out of heap memory.  You either have instantiated too many
 objects,
 or there are memory leaks in the source codes.

 Hope this will help you!
 Cheer!!

 Adam Shuy, President
 ePacific Web Design  Hosting
 Professional Web/Software developer
 TEL: 408-272-6946
 www.epacificweb.com

 -Original Message-
 From: Kai_testing Middleton [mailto:[EMAIL PROTECTED]
 Sent: Monday, July 16, 2007 8:43 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: OOM error during parsing with nekohtml

 You could try looking at these two discussions:
 http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html
 http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html

 --Kai

 - Original Message 
 From: Tsengtan A Shuy [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
 Sent: Monday, July 16, 2007 3:45:59 AM
 Subject: RE: OOM error during parsing with nekohtml

 I successfully run the whole-web crawl with the my new ubuntu OS, and I am
 ready to fix the bug.  I need someone to guide me to get the most updated
 source code and the bug assignment.

 Thank you in advance!!

 Adam Shuy, President
 ePacific Web Design  Hosting
 Professional Web/Software developer
 TEL: 408-272-6946
 www.epacificweb.com
 -Original Message-
 From: Shailendra Mudgal [mailto:[EMAIL PROTECTED]
 Sent: Monday, July 16, 2007 3:05 AM
 To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org
 Subject: OOM error during parsing with nekohtml

 Hi All,

 We are getting an OOM Exception during the processing of
 http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
 Nutch-497 patch to our source code. But actually the error is coming
 during
 the parse method.
 Does anybody has any idea regarding this.  Here is the complete stacktrace
 :

 java.lang.OutOfMemoryError: Java heap space
 at java.lang.String.toUpperCase(String.java:2637)
 at java.lang.String.toUpperCase(String.java:2660)
 at
 org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
 NamespaceBinder.ja
 va:443)
 at
 org.cyberneko.html.filters.NamespaceBinder.startElement(
 NamespaceBinder.java
 :252)
 at
 org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java
 :100
 9)
 at
 org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
 at
 org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(
 HTMLScanner.j
 ava:2343)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
 at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
 at
 org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
 at
 org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
 at
 org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
 :16
 4)
 at
 org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
 at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
 at
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
 at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


 Regards,
 Shailendra










 
 
 Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated
 for today's economy) at Yahoo! Games.
 http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow






--
Doğacan Güney


Re: Nutch nightly build and NUTCH-505 draft patch

2007-07-11 Thread Doğacan Güney

Hi,

On 7/2/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

Recently I successully applied applied NUTCH-505_draft_v2.patch as follows:

$ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
$ cd nutch
$ wget 
https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
 --no-check-certificate
$ sudo patch -p0  NUTCH-505_draft_v2.patch
$ ant clean
$ ant

However, I also needed other recent nutch functionality, so I downloaded a 
nightly build:

$ wget 
http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz

I then attempted to apply the patch to that build using the successive steps.  I was able to run 
ant clean but ant failed with

build.xml:61: Specify at least one source--a file or resource collection

Do I need to get a source checkout of a nightly build?  How would I do that?



Once you checkout nutch trunk with svn checkout, you can use svn
up to get the latest code changes. You can also use svn st -u which
compares your local version against trunk and shows you what changed.





Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/



--
Doğacan Güney


Re: OPIC scoring differences

2007-07-11 Thread Doğacan Güney

On 7/9/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Carl Cerecke wrote:
 Hi,

 The docs for the OPICScoringFilter mention that the plugin implements a
 variant of OPIC from Artiboul et al's paper. What exactly is different?
 How does the difference affect the scores?

As it is now, the implementation doesn't preserve the total cash value
in the system, and also there is almost no smoothing between the
iterations (Abiteboul's history).

As a consequence, scores may (and do) vary dramatically between
iterations, and they don't converge to stable values, i.e. they always
increase. For pages that get a lot of score contributions from other
pages this leads to an explosive increase into the range of thousands or
eventually millions. This means that the scores produced by the OPIC
plugin exaggerate score differences between pages more and more, even if
the web graph that you crawl is stable.

In a sense, to follow the cash analogy, our implementation of OPIC
illustrates a runaway economy - galloping inflation, rich get richer and
poor get poorer ;)


 Also, there's a comment in the code:

 // XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
 // XXX in the paper, where page loses its score if it's distributed to
 // XXX linked pages...

 Is this something that will be looked at eventually or is the scoring
 good enough at the moment without some adjustment.

Yes, I'll start working on it when I get back from vacations. I did some
simulations that show how to fix it (see
http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page).


Andrzej, nice to see you working on this.

There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.

t = 0 - Generate runs, A is generated.

t = 1 - Page A is fetched and its cash is distributed to its outlinks.

t = 2 - Generate runs, pages P0-Pn are generated.

t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks.
- At this time, it is possible that page Pk links to page A.
So, now Page A's cash  0.

t = 4 - Generate runs, page A is considered but is not generated
(since its next fetch time is later than current time).
- Won't page A become a temporary sink? Time between
subsequent fetches may be as large as 30 days in default
configuration. So, page A will accumulate cash for a long time without
distributing it.
- I don't see how we can achieve that, but, IMO, if a page is
considered but not generated, nutch should distribute its cash to
outlinks the outlinks that are stored in its parse data. (I know that
this is incredibly hard (if not impossible) to do this.)

Or am I missing something here?



--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Doğacan Güney


Re: OPIC scoring differences

2007-07-09 Thread Doğacan Güney

Hi,

On 7/9/07, Carl Cerecke [EMAIL PROTECTED] wrote:

Hi,

The docs for the OPICScoringFilter mention that the plugin implements a
variant of OPIC from Artiboul et al's paper. What exactly is different?
How does the difference affect the scores?

Also, there's a comment in the code:

// XXX (ab) no adjustment? I think this is contrary to the algorithm descr.
// XXX in the paper, where page loses its score if it's distributed to
// XXX linked pages...

Is this something that will be looked at eventually or is the scoring
good enough at the moment without some adjustment.


I certainly hope that this is something that will be looked at
eventually. IMHO,  scoring is not good enough, but it doesn't bother
anyone enough so that they decide to fix it.

Also, see Andrzej's comments in NUTCH-267 about why plugin
scoring-opic is not really OPIC. It is basically a glorified link
counter.



Cheers,
Carl.




--
Doğacan Güney


Re: NUTCH-119 :: how hard to fix

2007-06-28 Thread Doğacan Güney

On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

wow, setting db.max.outlinks.per.page immediately fixed my problem.  It looks 
like I totally mis-diagnosed things.

May I pose two questions:
1) how did you view all the outlinks?


bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser local_file


2) how severe is NUTCH-119 - does it occur on a lot of sites?


AFAIK, HtmlParser doesn't extract urls with regexps. Nutch uses a
regexp to extract outlinks from files that have no markup information
(such as plain text). See OutlinkExtractor.java.





- Original Message 
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix

On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:
 I am evaluating nutch+lucene as a crawl and search solution.

 However, I am finding major bugs in nutch right off the bat.

 In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
discussion of it here:
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html

 Most of the links off www.variety.com, one of my main test sites, have 
relative URLs.  It seems incredible that nutch, which is capable of mapreduce, 
cannot fetch these URLs.

 It could be that I would fix this bug if, for other reasons, I decide to go 
with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  Or 
are the developers, who are just volunteers anyway, more interested in fixing 
other problems?

 Could someone outline the issue for me a bit more clearly so I would know how 
to evaluate it?

Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).





   

 Park yourself in front of a world of choices in alternative vehicles. Visit 
the Yahoo! Auto Green Center.
 http://autos.yahoo.com/green_center/


--
Doğacan Güney









Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=listsid=396545433



--
Doğacan Güney


Re: [jira] Commented: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-06-28 Thread Doğacan Güney

On 6/28/07, Hudson (JIRA) [EMAIL PROTECTED] wrote:


[ 
https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508747
 ]

Hudson commented on NUTCH-474:
--

Integrated in Nutch-Nightly #131 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/])


*sigh*

I wrote NUTCH-474 instead of NUTCH-434 in svn log. Sorry everyone...



 Fetcher2 sets server-delay and blocking checks incorrectly
 --

 Key: NUTCH-474
 URL: https://issues.apache.org/jira/browse/NUTCH-474
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Andrzej Bialecki
 Fix For: 1.0.0

 Attachments: fetcher2.patch


 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay 
if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the 
opposite.
 2) Fetcher2 sets wrong configuration options so host blocking is still 
handled by the lib-http plugin (Fetcher2 is designed to handle blocking 
internally).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.





--
Doğacan Güney


JIRA email question

2007-06-27 Thread Doğacan Güney

Hi list,

There is this sentence at the end of every JIRA message:

You can reply to this email to add a comment to the issue online.

But, replying to a JIRA message through nutch-dev doesn't add it as a
comment. So you have to either reply to an email through JIRA (in
which case, it looks like you are responding to an imaginary person:)
or through email (in which case, part of the discussion doesn't get
documented in JIRA). Why doesn't this work?

--
Doğacan Güney


Re: NUTCH-119 :: how hard to fix

2007-06-26 Thread Doğacan Güney

On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

I am evaluating nutch+lucene as a crawl and search solution.

However, I am finding major bugs in nutch right off the bat.

In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
discussion of it here:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html

Most of the links off www.variety.com, one of my main test sites, have relative 
URLs.  It seems incredible that nutch, which is capable of mapreduce, cannot 
fetch these URLs.

It could be that I would fix this bug if, for other reasons, I decide to go 
with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable?  
Or are the developers, who are just volunteers anyway, more interested in 
fixing other problems?

Could someone outline the issue for me a bit more clearly so I would know how 
to evaluate it?


Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).






  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/



--
Doğacan Güney


Re: Found the bug in Generator when number of URLs is small

2007-06-21 Thread Doğacan Güney

On 6/21/07, Vishal Shah [EMAIL PROTECTED] wrote:

Hi,

   I think I found the reason why the generator returns with an empty
fetchlist for small fetchsizes.

   After the first job finishes running, the generator checks the following
condition to see if it got an empty list:

if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {

  The third condition is incorrect here. In some cases, esp. for small
fetchlists, the first partition might be empty, but some other partition(s)
might contain urls. In this case, the Generator is incorrectly assuming that
all partitions are empty by just looking at the first. This problem could
also occur when all URLs in the fetchlist are from the same host (or from a
very small number of hosts, or from a number of hosts that all map to a
small number of partitions).

  I fixed this problem by replacing the following code:

// check that we selected at least some entries ...
SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
tempDir);
if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
  LOG.warn(Generator: 0 records selected for fetching, exiting ...);
  LockUtil.removeLockFile(fs, lock);
  fs.delete(tempDir);
  return null;
}

With the following code:

   // check that we selected at least some entries ...
SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
tempDir);
boolean empty = true;
if (readers != null  readers.length  0) {
for (int num=0; numreaders.length; num++){
if (readers[num].next(new FloatWritable())) {
empty = false;
break;
}
}
}
if (empty) {
  LOG.warn(Generator: 0 records selected for fetching, exiting ...);
  LockUtil.removeLockFile(fs, lock);
  fs.delete(tempDir);
  return null;
}

This seems to do the trick.


Nice catch. Can you open a JIRA issue and attach a patch there?



Regards,

-vishal.




--
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Doğacan Güney
Publishing Javadoc
Recording test results




This is rather strange. Here is part of the console output:

test:
[echo] Testing plugin: parse-swf
   [junit] Running org.apache.nutch.parse.swf.TestSWFParser
   [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec
   [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec

init:
   [junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED


SWFParser fails one of the unit tests but the report says that
FeedParser has failed even though it has actually passed its test:

test:
[echo] Testing plugin: feed
   [junit] Running org.apache.nutch.parse.feed.TestFeedParser
   [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec


--
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Doğacan Güney

On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote:


This is rather strange. Here is part of the console output:

test:
 [echo] Testing plugin: parse-swf
[junit] Running org.apache.nutch.parse.swf.TestSWFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec

init:
[junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED


SWFParser fails one of the unit tests but the report says that
FeedParser has failed even though it has actually passed its test:

test:
 [echo] Testing plugin: feed
[junit] Running org.apache.nutch.parse.feed.TestFeedParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec




(ant test forks processes to test code, that's why we are seeing test
outputs out of order.)

Anyway, it is not TestSWFParser but TestFeedParser that fails. I am
trying to understand why it fails. Chris, can you lend me a hand here?

--
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Doğacan Güney

On 6/20/07, Chris Mattmann [EMAIL PROTECTED] wrote:

Doğacan,

 This is strange indeed. I noticed this during my testing of parse-feed,
however, thought it was an anomaly. I got this same strange cryptic unit
test error message, and then after some frustration figuring it out, I did
ant clean, then ant compile-core test, and miraculously the error seemed to
go away. Also, if you go into $NUTCH/src/plugin/feed/ and run ant clean test
(of course after running ant compile-core from the top-level $NUTCH dir),
the unit tests seem to pass?

[XXX:src/plugin/feed] mattmann% pwd
/Users/mattmann/src/nutch/src/plugin/feed
[XXX:src/plugin/feed] mattmann% ant clean test
Searching for build.xml ...
Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml

clean:
   [delete] Deleting directory /Users/mattmann/src/nutch/build/feed
   [delete] Deleting directory /Users/mattmann/src/nutch/build/plugins/feed

init:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: feed
[javac] Compiling 2 source files to
/Users/mattmann/src/nutch/build/feed/classes

compile-test:
[javac] Compiling 1 source file to
/Users/mattmann/src/nutch/build/feed/test

jar:
  [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar

deps-test:

init:

init-plugin:

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: protocol-file

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed

copy-generated-lib:
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed
 [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed

test:
 [echo] Testing plugin: feed
[junit] Running org.apache.nutch.parse.feed.TestFeedParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec

BUILD SUCCESSFUL
Total time: 3 seconds
[XXX:src/plugin/feed] mattmann%

Any ideas?


It never passes for me (not even when I do it in src/plugin/feed). If
you check the output, parseResult only contains a single entry which
is rsstest.rss.

I think what causes this bug is (surprise, surprise) PrefixURLFilter.
We don't have a template for prefix-urlfilter.txt in conf, so it
doesn't get properly initialized and (I can't figure out why but)
randomly filters out stuff.

When I put a sample prefix-urlfilter.txt(*) under conf, all tests seem to pass.

(*) As your friendly neighborhood Nutch developer, I even put up a
sample file at:

http://www.ceng.metu.edu.tr/~e1345172/prefix-urlfilter.txt



Cheers,
  Chris




On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote:

 On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote:

 This is rather
 strange. Here is part of the console output:

 test:
  [echo] Testing
 plugin: parse-swf
 [junit] Running
 org.apache.nutch.parse.swf.TestSWFParser
 [junit] Tests run: 1, Failures:
 0, Errors: 0, Time elapsed: 2.315 sec
 [junit] Tests run: 1, Failures: 1,
 Errors: 0, Time elapsed: 5.387 sec

 init:
 [junit] Test
 org.apache.nutch.parse.feed.TestFeedParser FAILED


 SWFParser fails one of
 the unit tests but the report says that
 FeedParser has failed even though it
 has actually passed its test:

 test:
  [echo] Testing plugin: feed

 [junit] Running org.apache.nutch.parse.feed.TestFeedParser
 [junit] Tests
 run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec



(ant test forks
 processes to test code, that's why we are seeing test
outputs out of
 order.)

Anyway, it is not TestSWFParser but TestFeedParser that fails. I
 am
trying to understand why it fails. Chris, can you lend me a hand here?

--

Doğacan Güney


__
Chris A. Mattmann
[EMAIL PROTECTED]
Key Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.






--
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Doğacan Güney

On 6/20/07, Dennis Kubes [EMAIL PROTECTED] wrote:

Is this the same java 6 error that was popping up a while back?  For
some reason with java 6 the XML is being parsed differently in the SWF
parser and therefore unit tests looking for exact strings were failing.
  Could this be happening in the feed parser as well?


I ran into some other issues with Java 6 (backward compatibility,
right...), so I actually switched my Java back to 5, at least for this
computer.



Dennis Kubes

Chris Mattmann wrote:
 Doğacan,

  This is strange indeed. I noticed this during my testing of parse-feed,
 however, thought it was an anomaly. I got this same strange cryptic unit
 test error message, and then after some frustration figuring it out, I did
 ant clean, then ant compile-core test, and miraculously the error seemed to
 go away. Also, if you go into $NUTCH/src/plugin/feed/ and run ant clean test
 (of course after running ant compile-core from the top-level $NUTCH dir),
 the unit tests seem to pass?

 [XXX:src/plugin/feed] mattmann% pwd
 /Users/mattmann/src/nutch/src/plugin/feed
 [XXX:src/plugin/feed] mattmann% ant clean test
 Searching for build.xml ...
 Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml

 clean:
[delete] Deleting directory /Users/mattmann/src/nutch/build/feed
[delete] Deleting directory /Users/mattmann/src/nutch/build/plugins/feed

 init:
 [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed
 [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes
 [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test
 [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data
  [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data

 init-plugin:

 deps-jar:

 compile:
  [echo] Compiling plugin: feed
 [javac] Compiling 2 source files to
 /Users/mattmann/src/nutch/build/feed/classes

 compile-test:
 [javac] Compiling 1 source file to
 /Users/mattmann/src/nutch/build/feed/test

 jar:
   [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar

 deps-test:

 init:

 init-plugin:

 compile:

 jar:

 deps-test:

 deploy:

 copy-generated-lib:

 init:

 init-plugin:

 deps-jar:

 compile:
  [echo] Compiling plugin: protocol-file

 jar:

 deps-test:

 deploy:

 copy-generated-lib:

 deploy:
 [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed
  [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed

 copy-generated-lib:
  [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed
  [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed

 test:
  [echo] Testing plugin: feed
 [junit] Running org.apache.nutch.parse.feed.TestFeedParser
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec

 BUILD SUCCESSFUL
 Total time: 3 seconds
 [XXX:src/plugin/feed] mattmann%

 Any ideas?

 Cheers,
   Chris




 On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote:

 On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote:

 This is rather
 strange. Here is part of the console output:

 test:
  [echo] Testing
 plugin: parse-swf
 [junit] Running
 org.apache.nutch.parse.swf.TestSWFParser
 [junit] Tests run: 1, Failures:
 0, Errors: 0, Time elapsed: 2.315 sec
 [junit] Tests run: 1, Failures: 1,
 Errors: 0, Time elapsed: 5.387 sec

 init:
 [junit] Test
 org.apache.nutch.parse.feed.TestFeedParser FAILED


 SWFParser fails one of
 the unit tests but the report says that
 FeedParser has failed even though it
 has actually passed its test:

 test:
  [echo] Testing plugin: feed

 [junit] Running org.apache.nutch.parse.feed.TestFeedParser
 [junit] Tests
 run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec



 (ant test forks
 processes to test code, that's why we are seeing test
 outputs out of
 order.)

 Anyway, it is not TestSWFParser but TestFeedParser that fails. I
 am
 trying to understand why it fails. Chris, can you lend me a hand here?

 --
 Doğacan Güney


 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Key Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group

 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___

 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.






--
Doğacan Güney


Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Doğacan Güney

On 6/20/07, Chris Mattmann [EMAIL PROTECTED] wrote:

On 6/20/07 7:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote:

 It never passes for me (not even when I do it in src/plugin/feed). If
you
 check the output, parseResult only contains a single entry which
is
 rsstest.rss.

Okay, please tell me I'm  not crazy here. I'm on Mac OS X 10.4, Java
version:

# java -version
java version 1.5.0_07
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)

I did a fresh checkout of the Nutch trunk. Then, from that dir, I run:

# ant compile-core
# cd src/plugin/feed
# ant clean test

All tests pass? Here is a log:

[XXX:~/src/nutch] mattmann% ant compile-core
Searching for build.xml ...
Buildfile: /Users/mattmann/src/nutch/build.xml

init:
[mkdir] Created dir: /Users/mattmann/src/nutch/build
[mkdir] Created dir: /Users/mattmann/src/nutch/build/classes
[mkdir] Created dir: /Users/mattmann/src/nutch/build/test
[mkdir] Created dir: /Users/mattmann/src/nutch/build/test/classes
[mkdir] Created dir: /Users/mattmann/src/nutch/build/hadoop
[unjar] Expanding: /Users/mattmann/src/nutch/lib/hadoop-0.12.2-core.jar
into /Users/mattmann/src/nutch/build/hadoop
[untar] Expanding: /Users/mattmann/src/nutch/build/hadoop/bin.tgz into
/Users/mattmann/src/nutch/bin
[mkdir] Created dir: /Users/mattmann/src/nutch/build/webapps
[unjar] Expanding: /Users/mattmann/src/nutch/lib/hadoop-0.12.2-core.jar
into /Users/mattmann/src/nutch/build

compile-core:
[javac] Compiling 172 source files to
/Users/mattmann/src/nutch/build/classes
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

BUILD SUCCESSFUL
Total time: 3 seconds
[XXX:~/src/nutch] mattmann% cd src/plugin/feed
[XXX:src/plugin/feed] mattmann% ant clean test
Searching for build.xml ...
Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data

clean:
   [delete] Deleting directory /Users/mattmann/src/nutch/build/feed

init:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: feed
[javac] Compiling 2 source files to
/Users/mattmann/src/nutch/build/feed/classes

compile-test:
[javac] Compiling 1 source file to
/Users/mattmann/src/nutch/build/feed/test

jar:
  [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar

deps-test:

init:
[mkdir] Created dir:
/Users/mattmann/src/nutch/build/nutch-extensionpoints
[mkdir] Created dir:
/Users/mattmann/src/nutch/build/nutch-extensionpoints/classes
[mkdir] Created dir:
/Users/mattmann/src/nutch/build/nutch-extensionpoints/test

init-plugin:

compile:

jar:
  [jar] Building MANIFEST-only jar:
/Users/mattmann/src/nutch/build/nutch-extensionpoints/nutch-extensionpoints.
jar

deps-test:

deploy:
[mkdir] Created dir:
/Users/mattmann/src/nutch/build/plugins/nutch-extensionpoints
 [copy] Copying 1 file to
/Users/mattmann/src/nutch/build/plugins/nutch-extensionpoints

copy-generated-lib:
 [copy] Copying 1 file to
/Users/mattmann/src/nutch/build/plugins/nutch-extensionpoints

init:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/protocol-file
[mkdir] Created dir:
/Users/mattmann/src/nutch/build/protocol-file/classes
[mkdir] Created dir: /Users/mattmann/src/nutch/build/protocol-file/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: protocol-file
[javac] Compiling 4 source files to
/Users/mattmann/src/nutch/build/protocol-file/classes

jar:
  [jar] Building jar:
/Users/mattmann/src/nutch/build/protocol-file/protocol-file.jar

deps-test:

deploy:
[mkdir] Created dir:
/Users/mattmann/src/nutch/build/plugins/protocol-file
 [copy] Copying 1 file to
/Users/mattmann/src/nutch/build/plugins/protocol-file

copy-generated-lib:
 [copy] Copying 1 file to
/Users/mattmann/src/nutch/build/plugins/protocol-file

deploy:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed

copy-generated-lib:
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed
 [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed

test:
 [echo] Testing plugin: feed
[junit] Running org.apache.nutch.parse.feed.TestFeedParser

upgrade to hadoop-0.13?

2007-06-18 Thread Doğacan Güney

Hi all,

As you know, hadoop-0.13 was recently released and it brings some
impressive improvements over hadoop-0.12.x series. So the obvious
question is: should we switch to hadoop-0.13?

I have tested nutch with hadoop-0.13 with all basic jobs (inject,
generate, fetch, parse, updatedb, invertlinks, index, dedup) and they
work fine.

--
Doğacan Güney


Re: Welcome Doğacan as Nutch committer

2007-06-12 Thread Doğacan Güney

Hi all,

Thank you everyone! It has been very exciting so far and I believe
that it is only going to get better from here on :)

Let me introduce myself (very) shortly: I am based in Ankara, Turkey.
I am 22 and currently working on my graduate degree.

I hope that together we will make nutch rock even harder.

--
Doğacan Güney


Re: [Fwd: Nutch 0.9 and Crawl-Delay]

2007-06-05 Thread Doğacan Güney

Hi,

On 6/4/07, Doug Cutting [EMAIL PROTECTED] wrote:

Does the 0.9 crawl-delay implementation actually permit multiple threads
to access a site simultaneously?


AFAIK, yes. Option fetcher.threads.per.host should be greater than 1
_only_ when you are accessing a site under your control. So, all of
nutch's politeness policies are pretty much ignored when
fetcher.threads.per.host is greater than 1.

Fetcher2 completely ignores nutch's server-delay and site's
crawl-delay value if maxThreads  1 and uses another min.crawl.delay
value when accessing the site.

I am not sure about Fetcher but I think it is going to allow
maxThreads many fetchers to access the site simultaneously then block
the next one.

There may be a better explanation in this post to nutch-dev:
Fetcher2's delay between successive requests  .




Doug

 Original Message 
Subject: Nutch 0.9 and Crawl-Delay
Date: Sun, 3 Jun 2007 10:50:24 +0200
From: Lutz Zetzsche [EMAIL PROTECTED]
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

Dear Nutch developers,

I have had problems with a Nutch based robot during the last 12 hours,
which I have now solved by banning this particular bot from my server
(not Nutch completely for the moment). The ilial bot, which created
considerable load on my server, was using the latest Nutch version -
v0.9 - which is now also supporting the crawl-delay directive in the
robots.txt.

The bot seems to have obeyed the directive - crawl-delay: 10 - as it
visited my website every 15 seconds, which would have been ok, BUT it
then submitted FIVE requests at once (see example log extract below)! 5
requests at once every 15 seconds is not acceptable on my server, which
is principally serving dynamic content and is often visited by up to 10
search engines at the same time, alltogether surely creating 99.9% of
the server traffic.

So my suggestion is that Nutch only submits one request each time, when
it detects a crawl-delay directive in the robots.txt. This is the
behaviour, the MSNbot shows for example. The MSNbot also liked to
submit several requests at once every few seconds, until I added the
crawl-delay directive to my robots.txt.


Best wishes

Lutz Zetzsche
http://www.sea-rescue.de/



72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /english/Photos+%26+Videos/PV/ HTTP/1.0 200
13661 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /english/Links/WRGL/Countries/ HTTP/1.0 200
15048 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/
HTTP/1.0 200 60041 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
based Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
66.249.72.244 - - [03/Jun/2007:04:40:55
+0200] GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/
HTTP/1.1 200 17568 - Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)
66.231.189.119 - - [03/Jun/2007:04:40:55
+0200] GET
/english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/

HTTP/1.0 200 17193 - Gigabot/2.0
(http://www.gigablast.com/spider.html)
74.6.86.105 - - [03/Jun/2007:04:40:56
+0200] GET /dansk/Links/Hermann+Apelt/ HTTP/1.0 200
30496 - Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/
HTTP/1.0 200 16658 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
based Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0
200 15624 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based
Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])




--
Doğacan Güney


Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Doğacan Güney

On 6/4/07, Sami Siren [EMAIL PROTECTED] wrote:

Briggs wrote:
 Yeah, you are correct there.  How does this thing actually even
 remotely begin to work on a  predictable level?

One crucial aspect of language identification is that the input properly
encoded. There was a patch that added icu4j character set encoding
detection into Nutch. I believe icu4j also offers language
identification in addition to character set detection. Has anyone
checked how usable the language identification from icu4j would be?

There is severe problems with current language identification for CJK
for example.



Can you give a few links? I have looked at icu4j's API, but I haven't
found any info about language identification.

IBM does have something called Linguini
(http://www-306.ibm.com/software/globalization/topics/linguini/index.jsp)
. It doesn't seem to be open source, though.



--
 Sami Siren




--
Doğacan Güney


Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

2007-06-05 Thread Doğacan Güney

On 6/5/07, Sami Siren [EMAIL PROTECTED] wrote:


I just saw this on api and assumed it had to do with detecting the
language, I might be wrong:

http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()


I think that method is used to get detected charset's ISO code. Like,
it returns tr for ISO-8859-9.

That being said, language identification is a very crucial feature and
if it doesn't work properly, well, someone should do something about
it :).




--
 Sami Siren




--
Doğacan Güney


Re: Plugins and Thread Safety

2007-06-04 Thread Doğacan Güney

Hi,

On 6/4/07, Briggs [EMAIL PROTECTED] wrote:

So, I synchronized it and it seems that the problem has not repeated
itself.  I think that was it.


That's great. Can you open a JIRA issue and submit a patch for this?



Thanks


On 6/1/07, Briggs [EMAIL PROTECTED] wrote:

 I will get back to you.  It isn't the easiest bug to test.  So, will
 let you know soon!

 On 6/1/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
  Briggs wrote:
   Oh, you want me to change the getSorted method to be synchronized?
   I'll put a lock in there and see what happens, if that is what you are
   referring to.
 
  Yes, please try this change.
 
 
  --
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 


 --
 Conscious decisions by conscious minds are what make reality real




--
Conscious decisions by conscious minds are what make reality real




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-31 Thread Doğacan Güney

On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:

On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Doğacan Güney wrote:

  My patch is just a draft to see if we can create a better caching
  mechanism. There are definitely some rough edges there:)

 One important information: in future versions of Hadoop the method
 Configuration.setObject() is deprecated and then will be removed, so we
 have to grow our own caching mechanism anyway - either use a singleton
 cache, or change nearly all API-s to pass around a user/job/task context.

 So, we will face this problem pretty soon, with the next upgrade of Hadoop.

Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.




  You are right about per-plugin parameters but I think it will be very
  difficult to keep PluginProperty class in sync with plugin parameters.
  I mean, if a plugin defines a new parameter, we have to remember to
  update PluginProperty. Perhaps, we can force plugins to define
  configuration options it will use in, say, its plugin.xml file, but
  that will be very error-prone too. I don't want to compare entire
  configuration objects, because changing irrevelant options, like
  fetcher.store.content shouldn't force loading plugins again, though it
  seems it may be inevitable

 Let me see if I understand this ... In my opinion this is a non-issue.

 Child tasks are started in separate JVMs, so the only context
 information that they have is what they can read from job.xml (which is
 a superset of all properties from config files + job-specific data +
 task-specific data). This context is currently instantiated as a
 Configuration object, and we (ab)use it also as a local per-JVM cache
 for plugin instances and other objects.

 Once we instantiate the plugins, they exist unchanged throughout the
 lifecycle of JVM (== lifecycle of a single task), so we don't have to
 worry about having different sets of plugins with different parameters
 for different jobs (or even tasks).

 In other words, it seems to me that there is no such situation in which
 we have to reload plugins within the same JVM, but with different
 parameters.

Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.


Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?





 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




--
Doğacan Güney




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney

Hi,

On 5/29/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:


 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

Some comments about you patch. The approach seems nice, you only check
the parameters that affect plugin loading. But have in mind that the
plugin themselves will configure themselves with many other parameters,
so to keep things safe there should be a PluginRepository for each set
of parameters (including all of them). Besides, remember that CACHE is a
WeakHashMap, you are creating ad-hoc PluginProperty objects as keys,
something doesn't loook right... the lifespan of those objects will be
much shorter than you require, perhaps you should be using
SoftReferences instead, or a simple LRU (LinkedHashMap provides that
simply) cache.


My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)

I don't really worry about WeakHashMap-LinkedHashMap stuff. But your
approach is simple and should be faster so I guess it's OK.

You are right about per-plugin parameters but I think it will be very
difficult to keep PluginProperty class in sync with plugin parameters.
I mean, if a plugin defines a new parameter, we have to remember to
update PluginProperty. Perhaps, we can force plugins to define
configuration options it will use in, say, its plugin.xml file, but
that will be very error-prone too. I don't want to compare entire
configuration objects, because changing irrevelant options, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable



Anyway, I'll try to build my own Nutch to test your patch.

Thanks!





--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-30 Thread Doğacan Güney

On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Doğacan Güney wrote:

 My patch is just a draft to see if we can create a better caching
 mechanism. There are definitely some rough edges there:)

One important information: in future versions of Hadoop the method
Configuration.setObject() is deprecated and then will be removed, so we
have to grow our own caching mechanism anyway - either use a singleton
cache, or change nearly all API-s to pass around a user/job/task context.

So, we will face this problem pretty soon, with the next upgrade of Hadoop.


Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.





 You are right about per-plugin parameters but I think it will be very
 difficult to keep PluginProperty class in sync with plugin parameters.
 I mean, if a plugin defines a new parameter, we have to remember to
 update PluginProperty. Perhaps, we can force plugins to define
 configuration options it will use in, say, its plugin.xml file, but
 that will be very error-prone too. I don't want to compare entire
 configuration objects, because changing irrevelant options, like
 fetcher.store.content shouldn't force loading plugins again, though it
 seems it may be inevitable

Let me see if I understand this ... In my opinion this is a non-issue.

Child tasks are started in separate JVMs, so the only context
information that they have is what they can read from job.xml (which is
a superset of all properties from config files + job-specific data +
task-specific data). This context is currently instantiated as a
Configuration object, and we (ab)use it also as a local per-JVM cache
for plugin instances and other objects.

Once we instantiate the plugins, they exist unchanged throughout the
lifecycle of JVM (== lifecycle of a single task), so we don't have to
worry about having different sets of plugins with different parameters
for different jobs (or even tasks).

In other words, it seems to me that there is no such situation in which
we have to reload plugins within the same JVM, but with different
parameters.


Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.




--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney

Hi,

On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )


Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch



Bye!




--
Doğacan Güney


Re: Plugins initialized all the time!

2007-05-29 Thread Doğacan Güney

On 5/29/07, Briggs [EMAIL PROTECTED] wrote:

I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.


Well then can you test the patch too? Nicolas's idea seems to be the
right one. After this patch, I think plugin loaders will see the same
PluginRepository instance.







On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 Hi,

 On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
  I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
  that the plugin repository initializes itself all the timem until I get
  an out of memory exception. I've been seeing the code... the plugin
  repository mantains a map from Configuration to plugin repositories, but
  the Configuration object does not have an equals or hashCode method...
  wouldn't it be nice to add such a method (comparing property values)?
  Wouldn't that help prevent initializing many plugin repositories? What
  could be the cause to may problem? (Aaah.. so many questions... =) )

 Which job causes the problem? Perhaps, we can find out what keeps
 creating a conf object over and over.

 Also, I have tried what you have suggested (better caching for plugin
 repository) and it really seems to make a difference. Can you try with
 this patch(*) to see if it solves your problem?

 (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

 
  Bye!
 


 --
 Doğacan Güney



--
Conscious decisions by conscious minds are what make reality real




--
Doğacan Güney


Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Doğacan Güney

Hi,

On 5/21/07, Marcin Okraszewski [EMAIL PROTECTED] wrote:

Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), because 
report-errors feature is being set. Shouldn't it be removed?


I would suggest that you open a JIRA issue and attach the patch there.
For this case, there is a similar issue(with patch) at NUTCH-369.



Cheers,
Marcin




--
Doğacan Güney


Re: retrieving original html from database

2007-04-25 Thread Doğacan Güney

On 4/25/07, Charlie Williams [EMAIL PROTECTED] wrote:

I have an index of pages from the web, a bit over 1 million. The fetch took
several weeks to complete, since it was mainly over a small set of domains.
Once we had a completed fetch, and index we began trying to work with the
retrieved text, and found that the cached text is just that, flat text. Is
the original HTML cached anywhere that it can be accessed after the intial
fetch? It would be a shame to have to recrawl all those pages. We are using
Nutch  .8


If you have fetcher.store.content set to true then Nutch has stored a
copy of all the pages in segment_dir/content. You can extract
individual contents with the command ./nutch readseg -get
segment_dir url -noparse -nofetch -nogenerate -noparsetext
-noparsedata.



Thanks for any help.

-Charlie




--
Doğacan Güney


  1   2   >