Re: Nutch robot hitting our Web servers
Hi, Thank you for the email. Can you provide some more information? For example, how many requests does the bot make per second, does it respect robots.txt, etc? On Mon, Dec 13, 2010 at 11:28, Chrislip, Ric chrisl...@hartwick.edu wrote: For several days now a Nutch robot from IP 174.36.195.29 has been hitting our run-time Web servers. I noticed because our event logs are showing many ASP.NET warnings about illegal characters in path. Your Web page at http://nutch.apache.org/bot.htm says that you would like to hear about any bad behavior. I have attached today's log entries from that IP address on one of our servers. Ric Chrislip Senior Programmer/Analyst, E-mail Administrator Clark Hall 111 Hartwick College Oneonta, New York, USA 607-431-4189 -- Doğacan Güney
Re: [Nutchbase] Multi-value ParseResult missing
Hey, On Thu, Jul 22, 2010 at 00:47, Andrzej Bialecki a...@getopt.org wrote: Hi, I noticed that nutchbase doesn't use the multi-valued ParseResult, instead all parse plugins return a simple Parse. As a consequence, it's not possible to return multiple values from parsing a single WebPage, something that parsers for compound documents absolutely require (archives, rss, mbox, etc). Dogacan - was there a particular reason for this change? No. Even though I wrote most of the original ParseResult code, I couldn't wrap my head around as to how to update WebPage (or old TableRow) API to use ParseResult. However, a broader issue here is how to treat compound documents, and links to/from them: a) record all URLs of child documents (e.g. with the !/ notation, or # notation), and create as many WebPage-s as there were archive members. This needs some hacks to prevent such urls from being scheduled for fetching. b) extend WebPage to allow for multiple content sections and their names (and metadata, and ... yuck) c) like a) except put a special synthetic mark on the page to prevent selection of this page for generation and fetching. This mark would also help us to update / remove obsolete sub-documents when their container changes. I'm leaning towards c). I was initially leaning towards (a) but I think (c) sounds good too. The nice thing about (c) is that these documents will correctly get inlinks (assuming the URL given to them makes sense, so I am thinking for an RSS feed, this will be the link element), etc. Though this can also be a problem too. Since in some instances, you may want to refetch a URL that happens to be a link in a feed. Now, when it comes to the ParseResult ... it's not an ideal solution either, because it means we have to keep all sub-document results in memory. We could avoid it by implementing something that Aperture uses, which is a sub-crawler - a concept of a parser plugin for compound formats. The main plugin would return a special result code, which basically says this is a compound format of type X, and then the caller (ParseUtil?) would use SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for the container. This parser in turn would simply extract sections of the compound document (as streams) and it would pass each stream to the regular parsing chain. The caller then needs to iterate over results returned from the SubCrawler. What do you think? This is excellent :) +1. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Merging in nutchbase
Hey everyone, I would like to start merging in nutchbase to trunk, so I am hoping to get everyone's comments and suggestions on how to do that. Some of the other changes in nutchbase (such as deleting nutch's own indexing system) have already been incorporated in nutch trunk so I think, the difference between nutchbase and nutch trunk has been reduced to scope of NUTCH-650 and NUTCH-811, i.e., abstracting storage away from nutch. Unfortunately, AFAICS, there is no easy way to separate NUTCH-650 into smaller patches. All nutch jobs and all plugins have to be updated to use the new String, WebPage API and it needs to be done at once. So if no one has any objections, I want to create a gigantic patch that applies to current trunk and attach it into NUTCH-650 and commit it soon (I want to do this quickly because nutch development speed is picking up again, and I am worried that issues like NUTCH-843, while making perfect sense, will wreak havoc on my merging efforts :) What does everyone think? -- Doğacan Güney
Re: Merging in nutchbase
Hey everyone, On Sat, Jul 10, 2010 at 17:43, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, +1 to Andrzej’s suggestion. I mostly run small scale stuff with Nutch, so unless I can run HBase in small scale (or better yet, an embedded SQL db), I won’t be as much use! :) I just want to make clear that this is, indeed, a goal I share. Gora already has an SQL backend that can use embedded hsqldb. However, there are some weird bugs (I really hate SQL :), but once I am done fixing all bugs (which I will be doing today and tomorrow), nutch will run on gora - (embedded hsqldb) with zero configuration. Cheers, Chris On 7/10/10 7:28 AM, Andrzej Bialecki a...@getopt.org wrote: On 2010-07-10 15:24, Julien Nioche wrote: I agree with Andrzej that the SQL backend has to be checked and tested on nutchbase before we can start porting it to the trunk. Moreover I have raised an important design issue on the list recently (table per fetchround) which needs some changes to Gora first and must be discussed, implemented and tested in NutchBase before we port it to trunk This could go either way, whichever is more convenient - I don't see it as something to necessarily withhold the merge. Without the first issue, though, we lose the ability to develop, test and run in local mode... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: *chris.mattm...@jpl.nasa.gov *WWW: *http://sunset.usc.edu/~mattmann/ *++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Doğacan Güney
Re: Merging in nutchbase
On Sat, Jul 10, 2010 at 17:28, Andrzej Bialecki a...@getopt.org wrote: On 2010-07-10 15:24, Julien Nioche wrote: I agree with Andrzej that the SQL backend has to be checked and tested on nutchbase before we can start porting it to the trunk. Moreover I have raised an important design issue on the list recently (table per fetchround) which needs some changes to Gora first and must be discussed, implemented and tested in NutchBase before we port it to trunk This could go either way, whichever is more convenient - I don't see it as something to necessarily withhold the merge. Without the first issue, though, we lose the ability to develop, test and run in local mode... While I agree with the table per fetch issue, I would like to postpone it until after the merge. This issue is tricky for a couple of reasons. For example, AFAIK, cassandra's latest released version does not support live schema updates so you can not add/delete tables on a running cassandra machine. I guess we can use super columns as our tables, then use columns to store data but that may be sub-optimal. For SQL, as mentioned below, it is almost done. There is a weird bug where I do not read back what I just wrote. Once I figure out what's wrong, I think, it will be good to go. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: [Nutchbase] WebPage class is a generated code?
Hey, On Fri, Jul 2, 2010 at 17:26, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, Since they are generated, +1 to: - adding a filepattern to svn:ignore to ignore them - updating build.xml to autogenerate I created NUTCH-842 to track this problem. Cheers, Chris On 7/2/10 3:24 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: (This question is mostly to Dogacan Enis, but I encourage anyone familiar with the code to join the threads with [Nutchbase] - the sooner the better ;) ). I'm looking at src/gora/webpage.avsc and WebPage.java friends... presumably the java code was autogenerated from avsc using Gora? If so, we should put this autogeneration step in our build.xml. Or am I missing something? correct. if we keep the generated java classes in svn then we probably want to make this task optional i.e. it would not be done as part of the build tasks OR we can add it to the build but remove it from svn (or better add to svn ignore or whatever-it-is-called). J. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: *chris.mattm...@jpl.nasa.gov *WWW: *http://sunset.usc.edu/~mattmann/ *++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Doğacan Güney
Minimizing the number of stored fields for Solr
Hey everyone, This is not really a proposition but rather something I have been wondering for a while so I wanted to see what everyone is thinking. Currently in our solr backend, we have stored=true indexed=false fields and stored=true indexed=true fields. The former class of fields are mostly used for storing digest, caching information etc. I suggest that we get rid of all indexed=false fields and read all such data from storage backend. For the latter class of fields (i.e., stored=true indexed=true), I suggest that we set them to stored=false for everything but id field. As an example currently title is stored/indexed in solr while text is only indexed (thus, will need to be fetched from storage backend). But for hbase backend, title and text are already stored close together (in the same column family) so performance hit of reading just text or reading both will likely be same. And removing storage from solr may lead to better caching of indexed fields and may lead to better example. What does everyone think? -- Doğacan Güney
Nutchbase design doc
generator markers). GeneratorJob will also print crawl id to console. c) FetcherJob: During map phase, FetcherJob outputs Integer, FetcherEntrykey, WebPage pairs. The first member of the pair (integer) is a random integer between 0 and 65536. After map, these pairs are partitioned into reduces according to URL's hosts. Reduce works exactly like old Fetcher's map. FetcherJob can also continue interrupted fetches now (by giving -continue switch on command line). The random integer may seem pointless but it is actually quite important for performance. Let's say instead of a random integer we were to just output key, WebPage pairs from map. Again, let's say you have 1000 URLs from host a.com, and 500 URLs from b.com. Let's also say that we have 10 fetcher threads so maximum queue size is 500. In this case, since all URLs from host a.com will be processed by reduce before all URLs from b.com, during reduce phase, only one thread will fetch URLs from a.com while every other thread will be spin-waiting. However, with randomization, URLs from a.com and b.comwill be processed in a random order thus bandwidth utilization will be higher. d) ParserJob: ParserJob is straightforward. It is only a map (i.e., has 0 reducers). Simple parses all URLs with active parse plugins. e) DbUpdaterJob: This is a combination of updatedb and invertlinks jobs. If a URL is successfully parsed (which means, it will contain a parse marker), DbUpdaterJob will put its own marker. Note: It may make more sense to put a marker even if a URL is not successfully parsed. DbUpdaterJob also cleans all other markers. f) IndexerJob: Goes over all URLs with a db update marker (again, you can specify ALL URLs with update markers, or a crawl id), and indexes them. 4) What's missing Most of the core functionality and plugins have been ported. However, some tools and features are still missing: arc segment tools, PageRank scoring, field indexing API, etc -- Doğacan Güney
Re: Nutchbase design doc
Hi Alex, On Sat, Jul 3, 2010 at 14:45, Alex McLintock alex.mclint...@gmail.comwrote: Doğacan 2010/7/3 Doğacan Güney doga...@gmail.com: I am attaching first draft of a complete nutchbase design document. There are parts missing and parts not yet explained clearly but I would like to get everyone's opinion on what they think so far. Thanks. I read your design and found it quite clear - at least to this non committer. :-) I would suggest that we should take this opportunity to do a full design document including design which has not changed from v1 to v2. So more please! I have made the odd comment which was then explained later on. I guess that meant I was a bit confused :-) Thanks for the excellent comments. I will try to explain as best as I can. Please let me know which parts are unclear, which parts make no sense etc, and I will improve the draft. The main thing I missed was any kind of overview of the data flow. I'd like to see a description of how a url/webpage goes through our system from being unknown, injected or discovered, to being queued for fetching in the generator, to fetched, parsed, fetchscheduled, scored, and generated again. Plus of course, indexed by sending to solr and being seen by an end user application. At each stage I'd like to see where the data is stored (text file, hbase, solr) and especially how this differs from the previous (text file, nutch crawldb, solr) I know that some of this may sound like a tutorial, but it is worth doing now rather than putting it off until later. One of the things nutchbase attempts to do is hide all the complexity of managing individual segments, crawl/link/whatever dbs from the user. Now, nutch delegates all storage handling to Gora (http://github.com/enis/gora). What Gora does is, it gives you a key-value store (in this case, keys are reversed URLs, values are WebPage objects), and you do all your work through these objects. So storage will not be an issue for you. Right now, Gora (and thus nutch) supports storing your data in hbase and sql (with cassandra and other backends coming soon). So with nutch and gora, you will start up your hbase/sql/cassandra/etc server(s), then nutch will figure out what to store and where. - Nutchbase = 1) Rationale * All your data in a central location (at least, nutch gives you the illusion of a centralized storage) But hbase is distributed across your hadoop cluster, right? This is the illusion you meant. Yes. Also, cassandra will be distributed too. Maybe in the future someone will write a HDFS-backed backend to Gora, then your data will actually live in separate files, but will still look like one centralized storage to you. 2) Design As mentioned above, all data for a URL is stored in a WebPage object. This object is accessed by a key that is the reverse form of a URL. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b This was clear and is the main point to convey :-) I would in fact like loads more info on the WebPage object. WebPage contains all data we have for a URL. Think Content + Parse Text + Parse Data + Crawl Datum + Outlinks + Inlinks... If URLs are stored lexicographically, this means that URLs from same domain and host are stored closer together. This will hopefully make developing statistics tools easier for hosts and domains. I am unconvinced by this. Yes we want host urls together so that we can easily do polite fetching from individual hosts. But would it make statistics tools easier? Maybe i don't know enough about hbase to be sure. This is not about polite fetching but let's say you want to count the number of fetched URLs from host foo.com. All you would have to do is to execute a scan (in hbase lingo, in gora these are called queries), between the start of foo.com and end of it. Since all URLs within a host are stored together, you do not have to go over the entire table to compute these statistics. Makes sense? Writing a MapReduce job that uses Gora for storage does not take much effort. This was confusing me. I thought that using Gora meant that we were losing the benefits of hdfs. So if we run a map reduce job over machines which are also HBase nodes does their input come from the hbase data stored on those nodes to reduce internal network traffic? IIRC, we do that in Gora already. But even if we don't (which means we forgot to do it), using Gora means that you deal with straightforward java objects and Gora figures out what to store and where. As I said, your data can also be in sql, cassandra, etc. I guess part of the confusion is that the project was called nutchbase (hence, implying it is about tying nutch into hbase). But it was just a stupid name I made up
Re: Nutchbase design doc
Hi Chris, On Sat, Jul 3, 2010 at 18:35, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Guys, This sounds awesome. Even I could understand it, which is saying something! :) My only question: why introduce a new data structure called “Markers” when all that seems to be is a Metadata object. Let’s use o.a.tika.metadata.Metadata to represent that? My only comment then would be, aren’t we still doing something you mentioned you wanted to get rid of below, where you said: “For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata.” Aren’t we just doing the same thing with Markers? Actually, markers used to be stored in the metadata object in WebPage (metadata is a map from string to bytes). It just seemed clearer to me to put it into its own field. We can discuss if moving it back into metadata makes more sense. One thing: We can't use tika's metadata object as WebPage object is generated from an avro schema. As for your last comment: Markers are only used to identify where we are in a crawl cycle and the individual crawl ids. So during parse, when we get a URL during MapReduce, parse can easily check if that URL has been fetched in *that* crawl cycle (since there is no point in parsing it if it hasn't been fetched). So it is not used to pass any important information around. It is just a simple tracking system. Did this make it any clearer? Cheers, Chris On 7/3/10 3:01 AM, Doğacan Güney doga...@gmail.com wrote: Hello everyone, I am attaching first draft of a complete nutchbase design document. There are parts missing and parts not yet explained clearly but I would like to get everyone's opinion on what they think so far. Please let me know which parts are unclear, which parts make no sense etc, and I will improve the draft. - Nutchbase = 1) Rationale * All your data in a central location (at least, nutch gives you the illusion of a centralized storage) * No more segment/crawldb/linkdb merges. * No more missing data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary after nutchbase. When writing a job or a new plugin, programmer only needs to specify which fields she wants to read and they will be available to plugin / job. * A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With nutchbase, you can just update that record. 2) Design As mentioned above, all data for a URL is stored in a WebPage object. This object is accessed by a key that is the reverse form of a URL. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar: http:8983/to/index.html?a=b If URLs are stored lexicographically, this means that URLs from same domain and host are stored closer together. This will hopefully make developing statistics tools easier for hosts and domains. Writing a MapReduce job that uses Gora for storage does not take much effort. There is a new class called StorageUtils that has a number of static methods to make setting mappers/reducers/etc easier. Here is an example (from GeneratorJob.java): Job job = new NutchJob(getConf(), generate: + crawlId); StorageUtils.initMapperJob(job, FIELDS, SelectorEntry.class, WebPage.class, GeneratorMapper.class, URLPartitioner.class); StorageUtils.initReducerJob(job, GeneratorReducer.class); An important argument is the second argument to #initMapperJob. This specifies all the fields that this job will be reading. If plugins will run during a job (for example, during ParserJob, several plugins will be active), then before job is run, those plugins must be initialized and FieldPluggable#getFields must be called for all plugins to figure out which fields they want to read. During map or reduce phase, modifying WebPage object is as simple as using the built-in setters. All changes will be persisted. Even though some of these objects are still there, most of the CrawlDatum, Content, ParseData or similar objects are removed (or are slated to be removed). In most cases, plugins will simply take (String key, WebPage page) as arguments and modify WebPage object in-place. 3) Jobs Nutchbase uses the concept of a marker to identify what has been processed and what will be processed from now on. WebPage object contains a mapString, String called markers. For example, when GeneratorJob generates a URL, it puts a unique string and a unique
Re: Nutch 2.0
Hi, On Tue, Jun 29, 2010 at 11:49, Julien Nioche lists.digitalpeb...@gmail.comwrote: Thanks Chris, I already shared my thoughts on this yesterday, but I still fail to see the advantage of keeping the details of the recent github nutchbase commits (some of them being just upgrades to the recent changes in 1.1) in svn nutchbase knowing that the point is actually to do incremental changes to the existing trunk (which already has the 1.1 changes) from svn nutchbase and review / comment / improve the code on this occasion. Since we also want to produce a patch in JIRA for the changes in svn nutchbase in order to put the donated to Apache stamp on it it would make sense to do that just once and not for all the commits which have been done in github. I am probably missing an important point here, but if so I would appreciate if someone (Dogacan?) could explain why we should not stick to the original plan (a) clear the existing svn nutchbase (b) generate a large patch with the code from github and JIRA it Do you mean generating a single patch vs nutch? There are a lot of fixes and improvements in nutch 1.1 that I cherry-picked to nutchbase later. If we generate a larger patch, and then this branch is blessed as trunk then history for those improvements will be lost. Or am I misunderstanding you here? (c) commit the changes to svn nutchbase then get on with the interesting bits. My concern is that proceeding as Dogacan described yesterday might take quite some time and block the rest of the work on 2.0. I am happy to work on the 3 steps above BTW. Thanks Julien On 29 June 2010 06:44, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Okey dokey guys, (c), (e) and (g) are done. Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e) and (f)... Cheers, Chris On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote: On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-06-28 17:57, Mattmann, Chris A (388J) wrote: Hi Doğacan, So your proposition is to combine (a) and (b) then? That’s fine by me, so long as there are no objections from others. I can still move forward with , (e) and (g) then... No objections from me - but IMHO to satisfy the legal minds you still need to produce a patch and attach to an issue with the Grant to ASF checkbox marked... OK, I'll create a new issue in JIRA, and then attach a lot of patches :) I'll try to appropriately mark patches that are straightforward ports from nutch 1.1 into nutchbase so that the same committers can commit those patches _again_ hopefully preserving post nutch 1.0 history as much as possible. (Also, I always shudder when I imagine a massive merge failing ... but that's probably a leftover from my CVS days when a failed merge would leave a completely broken tree.. ah, well, good luck :) ). I regularly do large merges in git and it works beautifully. We'll see how well SVN does :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: *chris.mattm...@jpl.nasa.gov *WWW: *http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ *++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com -- Doğacan Güney
Re: Nutch 2.0
Hey all, I will double check to make sure, but IIRC, there is no need to delete svn:nutchbase since current code on github simply builds on top of that. So why not simply merge github branch into svn? It will be a clear merge... The only problem is contributor info is messed up in github but I tried to preserve as much contrib info as possible when I pulled in 1.1 changes (via git cherry-pick). So we can break the code in github into smaller patches, apply them on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can be applied by _original_ committers, thus hopefully preserving contributor info as well. Makes sense? On Mon, Jun 28, 2010 at 16:45, Julien Nioche lists.digitalpeb...@gmail.comwrote: Hi, (a) deleting svn:nutchbase (b) svn:importing Git Nutchbase. (c) branch current 1.2-trunk as 1.2-branch (d) iteratively apply patches from new svn:nutchbase to trunk to bring it up to snuff. (e) roll the version # in nutch trunk to 2.0-dev (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where it makes sense (g) a 2.1 version is added to mark anything that we don't want in 2.0 and we file post 2.0 issues there (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is removed. All unit tests should pass regression where it makes sense. (i) Nutch documentation is brought up to date on wiki and checked into SVN (j) We roll a 2.0 release +1 I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to participate in (d) and (f). I'm thinking Julien and Doğacan would be the best people to do (b) and (i). Doğacan is in the process of writing the documentation (h) should be a result of all steps prior (a)-(g), and as for (j), I'd be happy to do (j) when the time comes. So, if I don't hear any objections, I'll do (a), (c), (e) and (g) tomorrow... (6/28, likely PM PST Los Angeles time) cool, thanks J. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com -- Doğacan Güney