Re: [Nutchbase] WebPage class is a generated code?

2010-07-03 Thread Doğacan Güney
Hey,

On Fri, Jul 2, 2010 at 17:26, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Hey Guys,

 Since they are generated, +1 to:


- adding a filepattern to svn:ignore to ignore them
- updating build.xml to autogenerate



I created NUTCH-842 to track this problem.


  Cheers,
 Chris




 On 7/2/10 3:24 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote:



 (This question is mostly to Dogacan  Enis, but I encourage anyone familiar
 with the code to join the threads with [Nutchbase] - the sooner the better
 ;) ).

 I'm looking at src/gora/webpage.avsc and WebPage.java  friends...
 presumably the java code was autogenerated from avsc using Gora? If so, we
 should put this autogeneration step in our build.xml. Or am I missing
 something?


 correct. if we keep the generated java classes in svn then we probably want
 to make this task optional i.e. it would not be done as part of the build
 tasks OR we can add it to the build but remove it from svn (or better add to
 svn ignore or whatever-it-is-called).

 J.



 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: *chris.mattm...@jpl.nasa.gov
 *WWW:   *http://sunset.usc.edu/~mattmann/
 *++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
Doğacan Güney


[jira] Created: (NUTCH-842) AutoGenerate WebPage code

2010-07-03 Thread JIRA
AutoGenerate WebPage code
-

 Key: NUTCH-842
 URL: https://issues.apache.org/jira/browse/NUTCH-842
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0


This issue will track the addition of an ant task that will automatically 
generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
src/gora/webpage.avsc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Minimizing the number of stored fields for Solr

2010-07-03 Thread Doğacan Güney
Hey everyone,

This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have stored=true indexed=false fields
and stored=true indexed=true fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all indexed=false fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but id field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?

-- 
Doğacan Güney


Nutchbase design doc

2010-07-03 Thread Doğacan Güney
Hello everyone,

I am attaching first draft of a complete nutchbase design document. There
are parts missing and parts not yet explained clearly but I would like to
get everyone's opinion on what they think so far. Please let me
know which parts are unclear, which parts make no sense etc, and I will
improve the draft.

-

Nutchbase
=

1) Rationale

* All your data in a central location (at least, nutch gives you the
illusion of a centralized storage)
* No more segment/crawldb/linkdb merges.
* No more missing data in a job. There are a lot of places where we copy
data from one structure to another just so that it is available in a later
job. For example, during parsing we don't have access to a URL's fetch
status. So we copy fetch status into content metadata. This will no longer
be necessary after nutchbase. When writing a job or a new plugin, programmer
only needs to specify which fields she wants to read and they will be
available to plugin / job.
* A much simpler data model. If you want to update a small part in a single
record, now you have to write a MR job that reads the relevant directory,
change the single record, remove old directory and rename new directory.
With nutchbase, you can just update that record.

2) Design

As mentioned above, all data for a URL is stored in a WebPage object. This
object is accessed by a key that is the reverse form of a URL. For example,

http://bar.foo.com:8983/to/index.html?a=b becomes
com.foo.bar:http:8983/to/index.html?a=b

If URLs are stored lexicographically, this means that URLs from same domain
and host are stored closer together. This will hopefully make developing
statistics tools easier for hosts and domains.

Writing a MapReduce job that uses Gora for storage does not take much
effort. There is a new class called StorageUtils that has a number of static
methods to make setting mappers/reducers/etc easier. Here is an example
(from GeneratorJob.java):

Job job = new NutchJob(getConf(), generate:  + crawlId);
StorageUtils.initMapperJob(job, FIELDS, SelectorEntry.class,
WebPage.class,
GeneratorMapper.class, URLPartitioner.class);
StorageUtils.initReducerJob(job, GeneratorReducer.class);

An important argument is the second argument to #initMapperJob. This
specifies all the fields that this job will be reading. If plugins will run
during a job (for example, during ParserJob, several plugins will be
active), then before job is run, those plugins must be initialized and
FieldPluggable#getFields must be called for all plugins to figure out which
fields they want to read. During map or reduce phase, modifying WebPage
object is as simple as using the built-in setters. All changes will be
persisted.

Even though some of these objects are still there, most of the CrawlDatum,
Content, ParseData or similar objects are removed (or are slated to be
removed). In most cases, plugins will simply take (String key, WebPage page)
as arguments and modify WebPage object in-place.

3) Jobs

Nutchbase uses the concept of a marker to identify what has been processed
and what will be processed from now on. WebPage object contains a
mapString, String called markers. For example, when GeneratorJob generates
a URL, it puts a unique string and a unique crawl id to this marker map.
Then FetcherJob only fetches a given URL if it contains this marker. At the
end of the crawl cycle, DbUpdaterJob clears all markers (except markers
placed by IndexerJob).

a) InjectorJob: This phase consists of two different jobs. First job reads
from a text file (as before) then puts a special inject marker. Second job
goes over all URLs then if it finds a WebPage with inject marker but nothing
else (so, it is a new URL), then this URL is injected (and marker is
deleted). Otherwise, marker is just deleted (since URL is already injected).

b) GeneratorJob: GeneratorJob is similar to what it was before, but it is
now a single job. During map phase:

if FetchSchedule indicates given URL is to be fetched then,
Calculate generator score using scoring filters
Output SelectorEntryURL, score, WebPage

SelectorEntry is sorted according to given score so highest scoring entries
will be processed first during reduce.

Then URLPartitioner partitions according to choices specified in config
files (by host, by ip or by domain).

Reduce phase counts URL according to topN and host/domain limits and marks
all URLs if limits are not yet reached.

GeneratorJob markes all generated URLs with a unique crawl id. Fetcher,
parser and indexer jobs can then use this crawl id to process only URLs that
are in that particular crawl cycle. Alternatively, they can also work on all
URLs (though, again, FetcherJob will only work on a URL which has been
marked by GenerateJob. So even if FetcherJob is instructed to work on all
URLs, it will skip those without 

Re: Minimizing the number of stored fields for Solr

2010-07-03 Thread Andrzej Bialecki

On 2010-07-03 10:00, Doğacan Güney wrote:

Hey everyone,

This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have stored=true indexed=false fields
and stored=true indexed=true fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all indexed=false fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but id field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?



The issue is not as simple as it looks. If you want to have a good 
performance for searching  snippet generation then you still need to 
store some data in stored fields - at least url, title, and plain text 
(not to mention the option to use term vectors in order to speed up the 
snippet generation). Solr functionality can be also impaired by a lack 
of data available directly from Lucene storage (field cache, faceting, 
term vector highlighting).


Some fields of course are not useful for display, but are used for 
searching only (e.g. anchors). These should be indexed but not stored in 
Solr. And it's ok to get them from non-solr storage if requested, 
because it's a rare event. The same goes for the full raw content, if 
you want to offer a cached view - this should not be stored in Solr 
but instead it should come from a separate layer (note that sometimes 
cached view might not be in the original format - pdf, office, etc - and 
instead an html representation may be more suitable, so in general the 
cached view shouldn't automatically equal the original raw content).


But for other fields I would argue that for now they should remain 
stored in Solr, *even the full text*, until we figure out how they 
affect the ability and performance of common search operations. E.g. if 
we remove the stored title field then we need to reach to the storage 
layer in order to display each page of results... not to mention issues 
like highlighting, faceting, function queries and a host of other 
functionalities that Solr can offer just because a field is stored in 
its index.


So I'm -0 to this proposal - of course we should review our schema, and 
of course we should have a mechanism to get data from the storage layer, 
but what you propose is IMHO a premature optimization at this point.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutchbase design doc

2010-07-03 Thread Doğacan Güney
Hi Alex,

On Sat, Jul 3, 2010 at 14:45, Alex McLintock alex.mclint...@gmail.comwrote:

 Doğacan


 2010/7/3 Doğacan Güney doga...@gmail.com:
  I am attaching first draft of a complete nutchbase design document. There
  are parts missing and parts not yet explained clearly but I would like to
  get everyone's opinion on what they think so far.

 Thanks. I read your design and found it quite clear - at least to this
 non committer. :-)
 I would suggest that we should take this opportunity to do a full
 design document including design which has not changed from v1 to v2.
 So more please!

 I have made the odd comment which was then explained later on. I guess
 that meant I was a bit confused :-)


Thanks for the excellent comments. I will try to explain as best as I can.


   Please let me
  know which parts are unclear, which parts make no sense etc, and I will
  improve the draft.


 The main thing I missed was any kind of overview of the data flow. I'd
 like to see a description of how a url/webpage goes through our system
 from being unknown, injected or discovered, to being queued for
 fetching in the generator, to fetched, parsed, fetchscheduled, scored,
 and generated again.
 Plus of course, indexed by sending to solr and being seen by an end
 user application.

 At each stage I'd like to see where the data is stored (text file,
 hbase, solr) and especially how this differs from the previous (text
 file, nutch crawldb, solr)

 I know that some of this may sound like a tutorial, but it is worth
 doing now rather than putting it off until later.


One of the things nutchbase attempts to do is hide all the complexity of
managing individual segments,
crawl/link/whatever dbs from the user. Now, nutch delegates all storage
handling to Gora (http://github.com/enis/gora).
What Gora does is, it gives you a key-value store (in this case, keys are
reversed URLs, values are WebPage objects), and
you do all your work through these objects. So storage will not be an issue
for you. Right now, Gora (and thus nutch) supports
storing your data in hbase and sql (with cassandra and other backends coming
soon).

So with nutch and gora, you will start up your hbase/sql/cassandra/etc
server(s), then nutch will figure out what to store and where.




  
 -
  Nutchbase
  =
  1) Rationale
  * All your data in a central location (at least, nutch gives you the
  illusion of a centralized storage)

 But hbase is distributed across your hadoop cluster, right? This is
 the illusion you meant.


Yes. Also, cassandra will be distributed too. Maybe in the future someone
will
write a HDFS-backed backend to Gora, then your data will actually live in
separate
files, but will still look like one centralized storage to you.



  2) Design
  As mentioned above, all data for a URL is stored in a WebPage object.
 This
  object is accessed by a key that is the reverse form of a URL. For
 example,
  http://bar.foo.com:8983/to/index.html?a=b becomes
  com.foo.bar:http:8983/to/index.html?a=b

 This was clear and is the main point to convey :-) I would in fact
 like loads more info on the WebPage object.


WebPage contains all data we have for a URL. Think Content + Parse Text +
Parse Data +
Crawl Datum + Outlinks + Inlinks...


   If URLs are stored lexicographically, this means that URLs from same
 domain
  and host are stored closer together. This will hopefully make developing
  statistics tools easier for hosts and domains.


 I am unconvinced by this. Yes we want host urls together so that we
 can easily do polite fetching from individual hosts. But would it make
 statistics tools easier? Maybe i don't know enough about hbase to be
 sure.


This is not about polite fetching but let's say you want to count the number
of fetched URLs from host foo.com.
All you would have to do is to execute a scan (in hbase lingo, in gora these
are called queries), between the start
of foo.com and end of it. Since all URLs within a host are stored together,
you do not have to go over the entire
table to compute these statistics. Makes sense?


  Writing a MapReduce job that uses Gora for storage does not take much
  effort.


 This was confusing me. I thought that using Gora meant that we were
 losing the benefits of hdfs. So if we run a map reduce job over
 machines which are also HBase nodes does their input come from the
 hbase data stored on those nodes to reduce internal network traffic?


IIRC, we do that in Gora already. But even if we don't (which means we
forgot to do it),
using Gora means that you deal with straightforward java objects and Gora
figures out
what to store and where. As I said, your data can also be in sql, cassandra,
etc.

I guess part of the confusion is that the project was called nutchbase
(hence, implying
it is about tying nutch into hbase). But it was just a stupid name I made up
:). 

Re: Nutchbase design doc

2010-07-03 Thread Mattmann, Chris A (388J)
Guys,

This sounds awesome. Even I could understand it, which is saying something! :)

My only question: why introduce a new data structure called “Markers” when all 
that seems to be is a Metadata object. Let’s use o.a.tika.metadata.Metadata to 
represent that? My only comment then would be, aren’t we still doing something 
you mentioned you wanted to get rid of below, where you said: “For example, 
during parsing we don't have access to a URL's fetch status. So we copy fetch 
status into content metadata.” Aren’t we just doing the same thing with Markers?

Cheers,
Chris





On 7/3/10 3:01 AM, Doğacan Güney doga...@gmail.com wrote:

Hello everyone,

I am attaching first draft of a complete nutchbase design document. There are 
parts missing and parts not yet explained clearly but I would like to get 
everyone's opinion on what they think so far. Please let me
know which parts are unclear, which parts make no sense etc, and I will improve 
the draft.

-

Nutchbase
=

1) Rationale

* All your data in a central location (at least, nutch gives you the illusion 
of a centralized storage)
* No more segment/crawldb/linkdb merges.
* No more missing data in a job. There are a lot of places where we copy data 
from one structure to another just so that it is available in a later job. For 
example, during parsing we don't have access to a URL's fetch status. So we 
copy fetch status into content metadata. This will no longer be necessary after 
nutchbase. When writing a job or a new plugin, programmer only needs to specify 
which fields she wants to read and they will be available to plugin / job.
* A much simpler data model. If you want to update a small part in a single 
record, now you have to write a MR job that reads the relevant directory, 
change the single record, remove old directory and rename new directory. With 
nutchbase, you can just update that record.

2) Design

As mentioned above, all data for a URL is stored in a WebPage object. This 
object is accessed by a key that is the reverse form of a URL. For example,

http://bar.foo.com:8983/to/index.html?a=b becomes 
com.foo.bar:http:8983/to/index.html?a=b

If URLs are stored lexicographically, this means that URLs from same domain and 
host are stored closer together. This will hopefully make developing statistics 
tools easier for hosts and domains.

Writing a MapReduce job that uses Gora for storage does not take much effort. 
There is a new class called StorageUtils that has a number of static methods to 
make setting mappers/reducers/etc easier. Here is an example (from 
GeneratorJob.java):

Job job = new NutchJob(getConf(), generate:  + crawlId);
StorageUtils.initMapperJob(job, FIELDS, SelectorEntry.class, WebPage.class,
GeneratorMapper.class, URLPartitioner.class);
StorageUtils.initReducerJob(job, GeneratorReducer.class);

An important argument is the second argument to #initMapperJob. This specifies 
all the fields that this job will be reading. If plugins will run during a job 
(for example, during ParserJob, several plugins will be active), then before 
job is run, those plugins must be initialized and FieldPluggable#getFields must 
be called for all plugins to figure out which fields they want to read. During 
map or reduce phase, modifying WebPage object is as simple as using the 
built-in setters. All changes will be persisted.

Even though some of these objects are still there, most of the CrawlDatum, 
Content, ParseData or similar objects are removed (or are slated to be 
removed). In most cases, plugins will simply take (String key, WebPage page) as 
arguments and modify WebPage object in-place.

3) Jobs

Nutchbase uses the concept of a marker to identify what has been processed and 
what will be processed from now on. WebPage object contains a mapString, 
String called markers. For example, when GeneratorJob generates a URL, it puts 
a unique string and a unique crawl id to this marker map. Then FetcherJob only 
fetches a given URL if it contains this marker. At the end of the crawl cycle, 
DbUpdaterJob clears all markers (except markers placed by IndexerJob).

a) InjectorJob: This phase consists of two different jobs. First job reads from 
a text file (as before) then puts a special inject marker. Second job goes over 
all URLs then if it finds a WebPage with inject marker but nothing else (so, it 
is a new URL), then this URL is injected (and marker is deleted). Otherwise, 
marker is just deleted (since URL is already injected).

b) GeneratorJob: GeneratorJob is similar to what it was before, but it is now a 
single job. During map phase:

if FetchSchedule indicates given URL is to be fetched then,
Calculate generator score using scoring filters
Output SelectorEntryURL, score, WebPage

SelectorEntry is sorted according to given score so highest 

Re: Nutchbase design doc

2010-07-03 Thread Doğacan Güney
Hi Chris,

On Sat, Jul 3, 2010 at 18:35, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Guys,

 This sounds awesome. Even I could understand it, which is saying something!
 :)

 My only question: why introduce a new data structure called “Markers” when
 all that seems to be is a Metadata object. Let’s use
 o.a.tika.metadata.Metadata to represent that? My only comment then would be,
 aren’t we still doing something you mentioned you wanted to get rid of
 below, where you said: “For example, during parsing we don't have access to
 a URL's fetch status. So we copy fetch status into content metadata.” Aren’t
 we just doing the same thing with Markers?


Actually, markers used to be stored in the metadata object in WebPage
(metadata is a map from string to bytes). It just seemed clearer to me to
put it into its own field. We can discuss if moving it back into metadata
makes more sense.

One thing: We can't use tika's metadata object as WebPage object is
generated from an avro schema.

As for your last comment: Markers are only used to identify where we are in
a crawl cycle and the individual crawl ids. So during parse, when we get a
URL during MapReduce, parse can easily check if that URL has been fetched in
*that* crawl cycle (since there is no point in parsing it if it hasn't been
fetched). So it is not used to pass any important information around. It is
just a simple tracking system. Did this make it any clearer?


 Cheers,
 Chris






 On 7/3/10 3:01 AM, Doğacan Güney doga...@gmail.com wrote:

 Hello everyone,

 I am attaching first draft of a complete nutchbase design document. There
 are parts missing and parts not yet explained clearly but I would like to
 get everyone's opinion on what they think so far. Please let me
 know which parts are unclear, which parts make no sense etc, and I will
 improve the draft.


 -

 Nutchbase
 =

 1) Rationale

 * All your data in a central location (at least, nutch gives you the
 illusion of a centralized storage)
 * No more segment/crawldb/linkdb merges.
 * No more missing data in a job. There are a lot of places where we copy
 data from one structure to another just so that it is available in a later
 job. For example, during parsing we don't have access to a URL's fetch
 status. So we copy fetch status into content metadata. This will no longer
 be necessary after nutchbase. When writing a job or a new plugin, programmer
 only needs to specify which fields she wants to read and they will be
 available to plugin / job.
 * A much simpler data model. If you want to update a small part in a single
 record, now you have to write a MR job that reads the relevant directory,
 change the single record, remove old directory and rename new directory.
 With nutchbase, you can just update that record.

 2) Design

 As mentioned above, all data for a URL is stored in a WebPage object. This
 object is accessed by a key that is the reverse form of a URL. For example,

 http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:
 http:8983/to/index.html?a=b

 If URLs are stored lexicographically, this means that URLs from same domain
 and host are stored closer together. This will hopefully make developing
 statistics tools easier for hosts and domains.

 Writing a MapReduce job that uses Gora for storage does not take much
 effort. There is a new class called StorageUtils that has a number of static
 methods to make setting mappers/reducers/etc easier. Here is an example
 (from GeneratorJob.java):

 Job job = new NutchJob(getConf(), generate:  + crawlId);
 StorageUtils.initMapperJob(job, FIELDS, SelectorEntry.class,
 WebPage.class,
 GeneratorMapper.class, URLPartitioner.class);
 StorageUtils.initReducerJob(job, GeneratorReducer.class);

 An important argument is the second argument to #initMapperJob. This
 specifies all the fields that this job will be reading. If plugins will run
 during a job (for example, during ParserJob, several plugins will be
 active), then before job is run, those plugins must be initialized and
 FieldPluggable#getFields must be called for all plugins to figure out which
 fields they want to read. During map or reduce phase, modifying WebPage
 object is as simple as using the built-in setters. All changes will be
 persisted.

 Even though some of these objects are still there, most of the CrawlDatum,
 Content, ParseData or similar objects are removed (or are slated to be
 removed). In most cases, plugins will simply take (String key, WebPage page)
 as arguments and modify WebPage object in-place.

 3) Jobs

 Nutchbase uses the concept of a marker to identify what has been processed
 and what will be processed from now on. WebPage object contains a
 mapString, String called markers. For example, when GeneratorJob generates
 a URL, it puts a unique string and a unique 

[jira] Updated: (NUTCH-838) Add timing information to all Tool classes

2010-07-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-838:


Fix Version/s: 1.2

I'll backport this to the 1.2 branch as well.

 Add timing information to all Tool classes
 --

 Key: NUTCH-838
 URL: https://issues.apache.org/jira/browse/NUTCH-838
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator, indexer, linkdb, parser
Affects Versions: 1.1
 Environment: JDK 1.6, Linux  Windows
Reporter: Jeroen van Vianen
Assignee: Chris A. Mattmann
 Fix For: 1.2, 2.0

 Attachments: timings.patch


 Am happily trying to crawl a few hundred URLs incrementally. Performance is 
 degrading suddenly after the index reaches approximately 25000 URLs.
 At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
 solrindex, solrdedup batch takes approximately half an hour with topN 500, 
 but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
 As I'm uncertain which of the phases takes so much time I decided to add 
 start and finish times to al classes that implement Tool so I at least have a 
 feeling and can review them in a log file.
 Am using pretty old hardware, but I am planning to recrawl these URLs on a 
 regular basis and if every iteration is going to take more and more time, 
 index updates will be few and far between :-(
 I added timing information to *all* Tool classes for consistency whereas 
 there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-838) Add timing information to all Tool classes

2010-07-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-838.
-

Resolution: Fixed

- Patch applied to trunk in r960246 and backported to 1.2-branch in r960248. I 
had to make some minor CR-LF mods and avoid patching a few files that were 
removed in the latest trunk. Thanks, Jeroen!

 Add timing information to all Tool classes
 --

 Key: NUTCH-838
 URL: https://issues.apache.org/jira/browse/NUTCH-838
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator, indexer, linkdb, parser
Affects Versions: 1.1
 Environment: JDK 1.6, Linux  Windows
Reporter: Jeroen van Vianen
Assignee: Chris A. Mattmann
 Fix For: 1.2, 2.0

 Attachments: timings.patch


 Am happily trying to crawl a few hundred URLs incrementally. Performance is 
 degrading suddenly after the index reaches approximately 25000 URLs.
 At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
 solrindex, solrdedup batch takes approximately half an hour with topN 500, 
 but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
 As I'm uncertain which of the phases takes so much time I decided to add 
 start and finish times to al classes that implement Tool so I at least have a 
 feeling and can review them in a log file.
 Am using pretty old hardware, but I am planning to recrawl these URLs on a 
 regular basis and if every iteration is going to take more and more time, 
 index updates will be few and far between :-(
 I added timing information to *all* Tool classes for consistency whereas 
 there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



YCSB benchmark for KV stores

2010-07-03 Thread Andrzej Bialecki

Hi,

Found this link:

http://wiki.github.com/brianfrankcooper/YCSB/papers-and-presentations

Would be cool to run the benchmark for the same stores but via Gora.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Hudson build is back to normal : Nutch-trunk #1197

2010-07-03 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/changes




[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884996#action_12884996
 ] 

Hudson commented on NUTCH-837:
--

Integrated in Nutch-trunk #1197 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])


 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-838) Add timing information to all Tool classes

2010-07-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884997#action_12884997
 ] 

Hudson commented on NUTCH-838:
--

Integrated in Nutch-trunk #1197 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])
- fix for NUTCH-838 Add timing information to all Tool classes


 Add timing information to all Tool classes
 --

 Key: NUTCH-838
 URL: https://issues.apache.org/jira/browse/NUTCH-838
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator, indexer, linkdb, parser
Affects Versions: 1.1
 Environment: JDK 1.6, Linux  Windows
Reporter: Jeroen van Vianen
Assignee: Chris A. Mattmann
 Fix For: 1.2, 2.0

 Attachments: timings.patch


 Am happily trying to crawl a few hundred URLs incrementally. Performance is 
 degrading suddenly after the index reaches approximately 25000 URLs.
 At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
 solrindex, solrdedup batch takes approximately half an hour with topN 500, 
 but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
 As I'm uncertain which of the phases takes so much time I decided to add 
 start and finish times to al classes that implement Tool so I at least have a 
 feeling and can review them in a log file.
 Am using pretty old hardware, but I am planning to recrawl these URLs on a 
 regular basis and if every iteration is going to take more and more time, 
 index updates will be few and far between :-(
 I added timing information to *all* Tool classes for consistency whereas 
 there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.