Re: hits.getTotal()

2005-07-07 Thread Doug Cutting
Ilia S. Yatsenko wrote: Why hits.getTotal() ignore hitsPerSite? hits.getTotal() always returns the total number of hits, regardless of site. hitsPerSite is a filter on hits as they are displayed. This is the way Google Yahoo handle this too. Search for NutchAnalysis there. If you look

Re: 0.7-dev, the search scoring

2005-07-28 Thread Doug Cutting
Fredrik Andersson wrote: I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff has changed I see! One thing I can't quite grasp though, is why the Hit.getScore() has been removed in favour for the TopDocs-thingie instead? Hit.getScore() was generalized to Hit.getSortValue() in

Re: near-term plan

2005-08-04 Thread Doug Cutting
Stefan Groschupf wrote: http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and

Re: near-term plan

2005-08-04 Thread Doug Cutting
Jay Pound wrote: Doug I also ran into this when I was testing ndfs the system would have to wait for the namenode to tell the datanodes what data to recieve and which data to replicate When did you test this? Which version of Nutch? How many nodes? My benchmark results from just a few days

Re: JIRA access

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be able to resolve issues, etc. Doug

Re: Nutch website deployment

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: So I have installed forrest and modified src/site/src/documentation/content/xdocs. Than run 'forrest'. And it generated content in src/site/build/site. And now the questions: Should I copy src/site/build/site to site and commit it? Yes. I'm impressed that you got

Re: ndfs problem needs fix

2005-08-08 Thread Doug Cutting
Jay Pound wrote: 1.) we need to split up chunks of data into sub-folders as not to run the filesystem out of its physical limitations of concurrent files in a single directory, like the way squid splits up its data into directories. I agree. I am currently using reiser with NDFS so this is

Re: User agent string

2005-08-08 Thread Doug Cutting
+1 Piotr Kosiorowski wrote: Hello, We should probably change user agent string in nutch-default.xml to point to Apache site. The only question is http.agent.version - should we set it to 0.07 for release and 0.08-dev for future work? I do not know how it was used previously. Current

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-08 Thread Doug Cutting
[EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I think this should now be: http://lucene.apache.org/nutch/bot.html The docs/en pages have mostly been reduced to the about page, whose translations I hate to

Re: Writable vs Externalizable

2005-08-08 Thread Doug Cutting
Stefan Groschupf wrote: can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? You don't miss much! I avoided using Java's

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Doug Cutting
Piotr Kosiorowski wrote: I read your email ten times and still I am not sure what the problem is. The problem is with me. Doug Cutting wrote: [EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I clicked

Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Doug Cutting
Piotr Kosiorowski wrote: Will do it tommorow - I wanted to put down a kind of release checklist in Wiki - starting with where to change numbers. But would like to cover also release howto - but in fact I am not sure how to do make a relase yet. But will try to gather this information. A

Re: Nutch versions - Was: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-10 Thread Doug Cutting
Piotr Kosiorowski wrote: I think we all refer to 0.7 as next number (and 0.6 as current) so nutch-default.xml contains wrong format. In fact it should still contain -dev suffix. To make undocumented comvention documented I would also like to suggest naming releases with X.Y format and naming

Re: mapred

2005-08-15 Thread Doug Cutting
Jay Pound wrote: is the org.apache.nutch.crawl package a part of the nightly builds? No. Nightly builds are from trunk. The mapred code is in a separate branch in subversion. After the 0.7 release, when the mapred branch is folded into trunk, then it will be in nightly builds. Until then

Re: MapRed - Injector - urlDir - Format?

2005-08-15 Thread Doug Cutting
Fuad Efendi wrote: Which parameter should I pass to Crawl? It should be directory containing smth. in which format? As before, inject takes a flat text files of urls, one per line. If you wish to inject DMOZ urls, there is now a utility main() that will convert the DMOZ file to such a file.

Re: Release 0.7

2005-08-16 Thread Doug Cutting
Piotr Kosiorowski wrote: Is anyone working on preparing the release? I am not. If not I can spent some time on it in an hour or so. +1 Thanks, Doug

Re: Slow Results

2005-08-16 Thread Doug Cutting
What API are you using to get hits, NutchBean or OpenSearchServlet? If you're using OpenSearchServlet, then, with 1000 hits, most of your time is probably spent constructing summaries. Do you need the summaries? If not, use NutchBean instead, or modify OpenSearchServlet to not generate

Re: Release 0.7 problem

2005-08-16 Thread Doug Cutting
Piotr Kosiorowski wrote: After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). It is strange nobody reported it so far so it may still be my fault. No, it looks like a problem with ant's tar task,

Re: svn.apache.org down?

2005-08-19 Thread Doug Cutting
Jérôme Charron wrote: svn.apache.org http://svn.apache.org down, or the problem is on my side? A good way to answer this is to look at: http://monitoring.apache.org/status/ It looks like SVN is currently up. And it works for me too. Doug

Re: [mapred] Possible bug, static primatives holding config values?

2005-08-30 Thread Doug Cutting
Jeremy Bensley (sent by Nabble.com) wrote: I have been experimenting with MapReduce to perform some distributed tasks aside from the normal fetch/index routine of Nutch, and overall have had much success. I'm glad to hear this! Today I have been experimenting with running extended duration

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Doug Cutting
Apache Wiki wrote: 1. The SVN repository consists of the following areas: a. '''trunk''' [ ... ] a. '''Release-x.x''' branches [ ... ] This should also mention tags, fixed versions of the code where no development occurs. I also would prefer that tag names and branch names are distinct,

Re: Automating workflow using ndfs

2005-08-31 Thread Doug Cutting
I assume that in most NDFS-based configurations the production search system will not run out of NDFS. Rather, indexes will be created offline for a deployment (i.e., merging things to create an index per search node), then copied out of NDFS to the local filesystem on a production search

Re: merge mapred to trunk

2005-08-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging mapred to trunk means losing Kelvin's changes, then I

Re: Event queues vs threads

2005-09-01 Thread Doug Cutting
Kelvin Tan wrote: Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance

Re: merge mapred to trunk

2005-09-15 Thread Doug Cutting
I will postpone the merge of the mapred branch into trunk until I have a chance to (a) add some MapReduce documentation; and (b) implement MapReduce-based dedup. Doug Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances

Re: Whole-web crawling with the mapreduce branch

2005-09-15 Thread Doug Cutting
For now, look at the source for crawl/Crawl.java. I'll try to add some documentation ASAP. Doug Steffen Viken Valvåg wrote: Hi, I'm playing around with the mapreduce branch, and got it working for a simple intranet crawl by following the nutch tutorial on

Re: use nutch file system independence ...

2005-09-18 Thread Doug Cutting
NDFS is not recommended in 0.7. The version of NDFS in the mapred branch is much improved. Note however that the mapred branch is substantially different than 0.7 and is still incomplete. Doug Transbuerg Tian wrote: hi, all friends, I download nutch0.7 ,and want use ndfs independence.

Re: mapred patch for improved error message and some javadoc comments

2005-09-19 Thread Doug Cutting
Paul Baclace wrote: Here is a patch for improving the error message that is displayed when an intranet crawl commandline has a file instead of a directory of files containing URLs. I have committed this to the mapred branch. Thanks, Paul! Doug

Re: question re: usage of createTempFile() for NDFS

2005-09-20 Thread Doug Cutting
Ordway, Ryan wrote: As a quick workaround, I made a few quick adjustments to the NDFSClient.java code to change the directory that temporary files are created in. This is hard coded to /nutch/tmp, but if someone could perhaps add a config option to make it configurable that would be most

Re: why task tracker ports random?

2005-09-26 Thread Doug Cutting
Stefan Groschupf wrote: Beside that a behavior like the datanode that iterates until it find a free port would be a better than just random. That would be fine. Would a patch have a chance to be applied? I can create one, but I wouldn't love to waste time in case people do not want to

Re: Random number generators for NDFS block numbers

2005-09-26 Thread Doug Cutting
Paul Baclace wrote: Doug Cutting expressed a concern to me about using util.Random to generate random 64 bit block numbers for NDFS. The following is my analysis. Nice stuff, Paul. Thanks. It just occurred to me that perhaps we could simply use sequential block numbering. All block ids

Re: what contibute to fetch slowing down

2005-10-03 Thread Doug Cutting
Fuad Efendi wrote: I found this in J2SE API for setReuseAddress(default: false): = When a TCP connection is closed the connection may remain in a timeout state for a period of time after the connection is closed (typically known as the TIME_WAIT state or 2MSL wait state). For applications

Re: tasks is not killed

2005-10-03 Thread Doug Cutting
Stefan Groschupf wrote: I notice that can happen that a task is still running when the job already was killed. The web gui says there is no running job and process hold the nodes busy. I haven't found the source of the problem yet. I have seen this too. I think the solution is that, when

Re: Nutch 0.7.1 and Nutch web site

2005-10-03 Thread Doug Cutting
Piotr Kosiorowski wrote: Should we have version independent site - always modified in trunk? Or should we think about having a site (eg. JavaDocs, tutorial etc) versioned and available for all versions at the same time? The practice I've followed is to have the website reflect the latest

Re: reprocessing hanging tasks

2005-10-10 Thread Doug Cutting
Stefan Groschupf wrote: May we misunderstand each other, I do not mean tasks that crash, I mean tasks that are 20 times slower on one machine as the other tasks on the other machines. Ah, I call that speculative re-exectution. Nutch does not yet implement that. I don't think speculative

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting
Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to

Re: nutch downloads

2005-10-12 Thread Doug Cutting
Erik Hatcher wrote: Please - someone reply back volunteering to correct this ASAP. My bad. I'm fixing this right now. In 24 hours all Nutch downloads should be through the mirrors. Sorry! Doug

Re: nutch downloads

2005-10-13 Thread Doug Cutting
Okay. All nutch downloads should now be through mirrors. The web site now refers to downloads through the url: http://www.apache.org/dyn/closer.cgi/lucene/nutch/ The former download urls now redirect to the appropriate places: http://lucene.apache.org/lucene/nutch/release/

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Doug Cutting
Chris Mattmann wrote: So, one thing it seems is that fields to be indexed, and used in a field query must be fully lowercase to work? Additionally, it seems that they can't have symbols in them, such as _, is that correct? Would you guys consider this to be a bug? Yes, this sounds like a bug.

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Doug Cutting
Chris Mattmann wrote: So, my question to you then is, what type of QueryFilter should I develop in order to get my query for contactemail:email address to work as a standalone query? For instance, right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be the right way to do it

Re: [Nutch-dev] [Fwd: Fetch list priority]

2005-10-19 Thread Doug Cutting
Massimo Miccoli wrote: Any news about integration of OPIC in mapred? I have time to develop OPIC on Nutch Mapred. Can you help me to start? By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams that the best way to integrate OPIC in on old webdb, is this way valid also CrawlDb

Re: OPIC

2005-10-19 Thread Doug Cutting
Here is a patch that implements this. I'm still testing it. If it appears to work well, I will commit it. Doug Cutting wrote: Massimo Miccoli wrote: Any news about integration of OPIC in mapred? I have time to develop OPIC on Nutch Mapred. Can you help me to start? By the email from

rel=nofollow

2005-10-20 Thread Doug Cutting
The attached patch adds support for rel=nofollow. Links which specify this are ignored. Any objections to committing this? http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html Doug Index: src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java

Re: status dedub

2005-10-25 Thread Doug Cutting
Stefan Groschupf wrote: I copy a working index and merge the original and the old together. Than I run the dedub over these index. Shouldn't the dedub tool remove the duplicates in the merged index? I usually dedup before index merge, so that the merged index contains no duplicates. The

Re: deltas to wiki page nutch/NutchDistributedFileSystem

2005-10-31 Thread Doug Cutting
Paul Baclace wrote: I hope someone can fold these into the wiki page since it appears as Immutable Page to me. You just need to create yourself an account by visiting: http://wiki.apache.org/nutch/UserPreferences Doug

Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting
Rod Taylor wrote: Every segment that I fetch seems to be missing a part when stored on the filesystem. The stranger thing is it is always the same part (very reproducible). This sounds strange. Are the datanode errors always on the same host? How many hosts are you running this on? Doug

Re: mapred questions

2005-11-04 Thread Doug Cutting
Ken van Mulder wrote: First is that the fetcher slows down over time and continues to use more and more memory as it goes (which I think is eventually hanging the process). What parser plugins do you have enabled? These are usually the culprit. Try using 'kill -QUIT' to see what various

Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting
Rod Taylor wrote: There is only a single datanode and there are 20 hosts. That's a lot of load on one datanode. I typically run a datanode on every host, accessing the local drives on that host. Doug

Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting
Rod Taylor wrote: I tried running one datanode per machine connecting back to the same SAN but it seemed pretty clunky. A crash of any datanode would take down the entire system (no data replication since it's a common data-store in the end). Reducing it to a single datanode did not have this

Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting
Rod Taylor wrote: Here you go. local filesystem and a single job tracker on another machine. When the tasktracker and jobtracker are on the same box there isn't a problem. When they are on different machines it runs into issues. This is using mapred.local.dir on the local machine (not sharedd

Re: mapred bug -- bad part calculation?

2005-11-08 Thread Doug Cutting
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the No input directories issue when using a local filesystem with multiple task

protocol-http versus protocol-httpclient

2005-11-09 Thread Doug Cutting
I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is

Re: Lucene or Nutch

2005-11-09 Thread Doug Cutting
Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it in the lucene contrib

Re: Index update and Google Dance

2005-11-09 Thread Doug Cutting
Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Doug Cutting
Massimo Miccoli wrote: Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back

Re: Max Per Host and topN

2005-11-10 Thread Doug Cutting
Rod Taylor wrote: It seems maxPerHost could cause us not to fill each segment to topN even when there are more than enough URLs for this job. We should only count URLs we keep instead of all URLs considered. There were also two variables named count which is probably bad form (not a Java

Re: Lucene or Nutch

2005-11-10 Thread Doug Cutting
Andrzej Bialecki wrote: I would be disappointed by this move - language identifier is an important component in Nutch. Now the mere fact that it's bundled with Nutch encourages its proper maintenance. If there is enough drive in terms of willingness and long-term commitment it would make sense

Re: [Nutch Wiki] Update of OverviewDeploymentConfigs by PaulBaclace

2005-11-11 Thread Doug Cutting
Great stuff, Paul! A few minor corrections. Apache Wiki wrote: 1. The env var NUTCH_MASTER is set to the hostname of the master machine. This is optional. The alternative is to mount a common home directory with NFS, as many clusters do, and keep the Nutch software there. Also,

Re: threading versus nio

2005-11-14 Thread Doug Cutting
Johannes Zillmann wrote: please correct me if i'm wrong, but if i understood all right there are 2 choices... (1) message based communication (2) stream based communication In case of (2) you won't come along without one thread per connection. In general, you are correct. But Nutch's IPC is

Re: problem with inject url on mapred

2005-11-16 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Yes, problem in negative progress percentages. Is /usr/root/seeds/urls the same file on all hosts? How big is it? Doug

Re: Question of Range search

2005-11-16 Thread Doug Cutting
Game Now wrote: Hi All, I wanna Nutch help me do a range search, such as price:{1000 TO 2000} or date[20050101 TO 2005]. But org.apache.nutch.searcher.Query#parse() method parse them to price 1000 2000 and date 20050101 2005 when i pass them to the method. Anybody can help me complete

Re: Problem with CRC files on NDFS

2005-11-21 Thread Doug Cutting
Andrzej Bialecki wrote: I have a problem with the recently added CRC files, when put-ting stuff to NDFS. NDFS complains that these files already exist - I suspect that it creates them on the fly just before they are actually transmitted from the NDFSClient - and aborts the transfer. I was able

Re: mapred.map.tasks

2005-11-21 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem. Can you please post a simple example that demonstrates the negative progress problem? E.g., the minimal

Re: Urlfilter bug (doesn't return on long URLs)

2005-11-21 Thread Doug Cutting
This sounds like a bug in the URLFilter implementation. Is this RegexURLFilter? Can you figure out what regex is causing this? Probably the patch should be there, no? Doug Rod Taylor wrote: I stuck a few log statements within ParseOutputFormat.java. One after 'String toUrl =' and another

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Doug Cutting
Andrzej Bialecki wrote: Further input into this: after replacing the ConjunctionScorer with the fixed version from JIRA, now the bottleneck seems to be ... in Summarizer, of all things. :-) While making the summarizer faster would of course be good, keep in mind that the cost of summarizing

[Fwd: Spider Causing Contact Form Submissions]

2005-11-22 Thread Doug Cutting
It looks as though Nutch is inadvertantly submitting forms. At DOMContentUtils.java:58 we specify that the action parameter of an HTML form should be extracted as a link. Yet we ignore the method parameter of the form. I think we should only follow these when the method is get, not when it

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!). The reader offers similar functionality to the classic readdb command. This looks great! Thanks, Andrzej. I just ran it on a 50M page crawl. It took longer than I expected. The reduce

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
Doug Cutting wrote: I just ran it on a 50M page crawl. FYI, here's the output: 051123 191703 TOTAL urls: 167780785 051123 191703 avg score:1.152 051123 191703 max score:47357.137 051123 191703 min score:1.0 051123 191703 retry 0: 167780785 051123 191703 status 1

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Ken Krugler wrote: For what it's worth, below is the filter list we're using for doing an html-centric crawl (no word docs, for example). Using the (?i) means we don't need to have upper lower-case versions of the suffixes.

Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Doug Cutting
Matt Kangas wrote: #2 should be a pluggable/hookable parameter. high-scoring sounds like a reasonable default basis for choosing recrawl intervals, but I'm sure that nearly everyone will think of a way to improve upon that for their particular system. e.g. high-scoring ain't gonna cut it

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Jérôme Charron wrote: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in order to try to keep us

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Matt Kangas wrote: The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... This could be a

Re: incremental crawling

2005-12-02 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: Modify CrawlDatum to store the MD5Hash of the content of fetched urls. Yes, this is required to detect unmodified content. A small note: plain MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages with a counter, or with ads

Re: RCP known limitation or bug?

2005-12-07 Thread Doug Cutting
This should work. TestRPC.java has a case which returns void (ping). Can you send a simple test case that fails? Doug Stefan Groschupf wrote: Hi, I never used the RCP that intensive so I was surprised to found this limitation. Is it known that the RCP.call method can only call methods that

Re: Lucene performance bottlenecks

2005-12-08 Thread Doug Cutting
Doug Cutting wrote: Implementing something like this for Lucene would not be too difficult. The index would need to be re-sorted by document boost: documents would be re-numbered so that highly-boosted documents had low document numbers. In particular, one could: 1. Create an array of int

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Okay, I'll try to get something working fairly soon. Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Attached is a class which sorts a Nutch index by boost. I have only tested it on a ~100 page index, where it appears to work correctly. Please tell me how it works for you.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. We will also need to estimate the total number of matches by extrapolating

[Fwd: Crawler submits forms?]

2005-12-13 Thread Doug Cutting
FYI This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. Doug Original Message Subject: Crawler submits forms? Date: Tue, 13 Dec 2005 16:57:34 - From: Andy Read [EMAIL PROTECTED] Reply-To:

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting
Andrzej Bialecki wrote: I'll test it soon - one comment, though. Currently you use a subclass of RuntimeException to stop the collecting. I think we should come up with a better mechanism - throwing exceptions is too costly. I thought about this, but I could not see a simple way to achieve

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Doug Cutting
Stefan Groschupf wrote: - job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. I don't recommend this change. It makes your crawler impolite, since multiple tasks may reference

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case. . When results differed

Re: mapreduce fetcher doesn't fetch all urls

2005-12-15 Thread Doug Cutting
Stefan Groschupf wrote: In case you setup one thread per host, you have maximal as much connections to one host as you have boxes. In may case that are not that much. Anything more than one is not generally considered polite. Also it is a reproducible bug that the segment is everytime

Re: Nutch design queries

2005-12-15 Thread Doug Cutting
Mike Cannon-Brookes wrote: Hey guys, Hi, Mike! Welcome. - Classloading - I have had many problems with NutchConf due to the way it loads it's resources. In a J2EE scenario, it's simply evil :) Would there be any great problem with switching it's classloader to

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: Please also don't forget that the trunk/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) Thinking about this more, perhaps we should do it sooner. There's already a branch for 0.7.x releases, so what point is there

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: I agree. I just thought that we would prepare the relase based on the code in trunk/ , and in that case we would like to wait with the merge before we do the release. My definition of trunk is that it should be where the majority of development happens. It is what we

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: Yes, we just need to make sure that all important bits from trunk are on the 0.7 branch, before we start. I will sync mapred with the trunk prior to the merge, so we should still be able to get anything we need after mapred is merged back to trunk. BTW, we're pretty

Re: Nutch design queries

2005-12-15 Thread Doug Cutting
Mike Cannon-Brookes wrote: 0.7 vs 0.8 - apologies if I'm using an old version. I'm using the latest binary release. I'll switch to latest SVN HEAD and see how that works in my application. The mapred branch will soon be moved to trunk, so you might be better off starting there, since a lot

Re: Nutch design queries

2005-12-15 Thread Doug Cutting
Doug Cutting wrote: Once the mapred branch is folded in then there's a bunch of stuff that's obsoleted that needs to be removed. I'd like to get dynamic configuration in, if possible. For reference, I found the message I posted about this a while back: http://www.mail-archive.com/nutch-dev

mapred merge to trunk

2005-12-15 Thread Doug Cutting
Sami Siren wrote: +1. I think this is good time to merge now as the mapred is fully usable. Barring objections, I will do this tomorrow morning, Pacific time. Doug

Re: svn commit: r357334 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/protocol/Content.java src/java/org/apache/nutch/protocol/ContentProperties.java

2005-12-17 Thread Doug Cutting
[EMAIL PROTECTED] wrote: +/* + * (non-Javadoc) + * + * @see org.apache.nutch.io.Writable#write(java.io.DataOutput) + */ +public final void write(DataOutput out) throws IOException { We should either include javadoc or not. In general, all public methods should have

no nightly builds until 27 December

2005-12-18 Thread Doug Cutting
I am leaving tomorrow for a one week vacation and will turn off my home workstation, so there will be no nightly builds. Long-term, I've submitted an infrastructure request to get a Solaris zone created for Nutch where we can run nightly builds. That will eventually remove the dependency on

Re: Bug in DeleteDuplicates.java ?

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: Gal Nitzan wrote: this function throws IOException. Why? public long getPos() throws IOException { return (doc*INDEX_LENGTH)/maxDoc; } It should be throwing ArithmeticException The IOException is required by the API of RecordReader.

Re: [bug?] PRC called emthod require parameter

2006-01-02 Thread Doug Cutting
Stefan Groschupf wrote: I also note this line in client.java public Writable[] call(Writable[] params, InetSocketAddress[] addresses) throws IOException { if (params.length == 0) return new Writable[0]; Do I understand it correct that in case the remote method does not need any

Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting
Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually

Re: IndexSorter optimizer

2006-01-04 Thread Doug Cutting
Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate indexes that are searched

Re: no static NutchConf

2006-01-04 Thread Doug Cutting
Andrzej Bialecki wrote: Example: what happens now if you try to run more than one fetcher at the same time, where the fetcher parameters differ (or a set of activated plugins differs)? You can't - the local tasks on each tasktracker will use whatever local config is there. That's true when

Re: Per-page crawling policy

2006-01-05 Thread Doug Cutting
Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as something that might prove very

Re: [VOTE] Commiter access for Stefan Groschupf

2006-01-05 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I'm late, but better late than never: +1 (I thought Stefan was already a committer, actually). +1 Not as late as I am! I'm still catching up on December email... The Lucene PMC has final say, and not all members of the PMC are on nutch-dev, so I'll forward the

  1   2   3   4   >