Re: nighly build brocken?

2006-04-11 Thread Byron Miller
i get nightly to run, but it never completes anything. always get stuck at 98% here and there.. i'll try todays build and see what happens. --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, looks like the latest nightly build is broken. Looks like the jar that comes with the nightly build

Re: scalability limits getDetails, mapFile Readers?

2006-03-02 Thread Byron Miller
I would like to see something as active, in process and inbound. Active data is live and on the query servers (both indexes and correlating segments) in process are tasks currently being mapped out and inbound is processes/data that is pending to be processed. Active nodes report as in the

Re: Carrot2 v. 1.0.1. [clustering plugin]

2006-02-03 Thread Byron Miller
I would love to see it continue as a plugin. I'm moving to mapreduce myself so i would be interested in utilizing it there. thanks for the great work! look forward to trying out your updates. feel free to contact me directly if you wish. -byron --- Dawid Weiss [EMAIL PROTECTED] wrote: Hi

indexSorter - applied to SVN or patch in Jira?

2006-01-31 Thread Byron Miller
Has indexsorter code discussed a while back been pushed to jira or put in SVN? I'd like to give it a whirl on some of my indexes and the archive i can find cut the post with the code attached..

[jira] Commented: (NUTCH-16) boost documents matching a url pattern

2006-01-28 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-16?page=comments#action_12364354 ] byron miller commented on NUTCH-16: --- Cool an inverse of this plugin would be great, or enhancement of this for +/- values based on patters as i think lowering score

[jira] Commented: (NUTCH-79) Fault tolerant searching.

2006-01-28 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364357 ] byron miller commented on NUTCH-79: --- Piotr, Any update on this? Have you been able to run with this or still working out the kinks? Fault tolerant searching

[jira] Commented: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-01-28 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-14?page=comments#action_12364358 ] byron miller commented on NUTCH-14: --- Are you still hitting this Stefan? NullPointerException NutchBean.getSummary - Key

Re: need volunteer to develop search for apache.org

2006-01-25 Thread Byron Miller
I'll be happy to do it. --- Doug Cutting [EMAIL PROTECTED] wrote: Would someone volunteer to develop Nutch-based site-search engine for all apache.org domains? We now have a Solaris zone to host this. Thanks, Doug

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-01-20 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12363400 ] byron miller commented on NUTCH-134: Thanks Erik, I was able to pull down the highlighter and i'll be loading it up on mozdex.com to test out over the weekend (1/21/2006

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

2006-01-20 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363477 ] byron miller commented on NUTCH-183: As Mr Burns would say eggcelent I'll give this a try. BTW, is it possible to implement functionality that would start jobs

Re: Problem with latest SVN during reduce phase

2006-01-13 Thread Byron Miller
because one document triggers such an exception. best regards, Dominik Byron Miller wrote: 60111 103432 reduce reduce 060111 103432 Optimizing index. 060111 103433 closing reduce 060111 103434 closing reduce 060111 103435

RE: MapReduce and segment merging

2006-01-12 Thread Byron Miller
I was thinking that Nutch needs some sort of workflow manager. This way you could build jobs off specific workflows and hopefully recover jobs based upon the portion of the workflow they are stuck. (or restart a job if failed/processing time x hours and other such workflow processes rules)

Problem with latest SVN during reduce phase

2006-01-11 Thread Byron Miller
60111 103432 reduce reduce 060111 103432 Optimizing index. 060111 103433 closing reduce 060111 103434 closing reduce 060111 103435 closing reduce java.lang.NullPointerException: value cannot be null at org.apache.lucene.document.Field.init(Field.java:469) at

Re: Per-page crawling policy

2006-01-05 Thread Byron Miller
Excellent Ideas and that is what i'm hoping to use some of the social bookmarking type ideas to build the starter sites from and linkmaps from. I hope to work with Simpy or other bookmarking projects to build somewhat of a popularity map(human edited authorit) to merge and calculate against a

Re: mapred crawling exception - Job failed!

2006-01-04 Thread Byron Miller
Fixed in the copy i run as i've been able to get my 100k pages indexed without getting that error. -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems

Re: IndexSorter optimizer

2006-01-03 Thread Byron Miller
On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) With this patch and a top result set in the xml file does that mean it will stop scanning the index at

Adding some theory publication links into the Wiki..

2006-01-03 Thread Byron Miller
I figured since i'm in research mode i woul start compiling available information resource and putthing them up on the wiki http://wiki.apache.org/nutch/Search_Theory sorry about all the cvs message on edits.. i'm not used to the touchpad on this darned laptop :) Anyhow, if you have any

[jira] Created: (NUTCH-159) Specify temp/working directory for crawl

2005-12-31 Thread byron miller (JIRA)
Reporter: byron miller I ran a crawl of 100k web pages and got: org.apache.nutch.fs.FSError: java.io.IOException: No space left on device at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149) at org.apache.nutch.fs.FileUtil.copyContents

[jira] Commented: (NUTCH-123) Cache.jsp some times generate NullPointerException

2005-12-31 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-123?page=comments#action_12361473 ] byron miller commented on NUTCH-123: Perhaps you should try the cache servlet as it dumps out the data as it sees it. Cache.jsp some times generate NullPointerException

[jira] Commented: (NUTCH-42) enhance search.jsp such that it can also returns XML

2005-12-31 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-42?page=comments#action_12361474 ] byron miller commented on NUTCH-42: --- Safe to close. (done) We have XML/OpenSearch in latest trunk and other branches. enhance search.jsp such that it can also returns XML

[jira] Created: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

2005-12-29 Thread byron miller (JIRA)
Versions: 0.8-dev Reporter: byron miller Priority: Minor Add support to the fetcher to look for sitemap files, download them and process them into webdb. Perhaps create a robots.txt directive that can be used to create a standard format for sitemaps in RSS, XML or text format (one line

[jira] Commented: (NUTCH-155) Remove web gui from the distribution to contrib and use OpenSearch Servlet

2005-12-29 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-155?page=comments#action_12361398 ] byron miller commented on NUTCH-155: I don't know how i feel about removing the JSP stuff into a contrib and then fluffing it up more with the potential to support other

Re: Mega-cleanup in trunk/

2005-12-28 Thread Byron Miller
I'll pull a build down tonight and let you know how it goes! -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-12-28 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12361348 ] byron miller commented on NUTCH-92: --- Has there been any advancement on this front? DistributedSearch incorrectly scores results

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2005-12-28 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12361350 ] byron miller commented on NUTCH-134: Where is the lucene summarizer from the contrib? i'm not seeing anything obvious (unless it's under a different name) Summarizer

[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

2005-12-27 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12361300 ] byron miller commented on NUTCH-95: --- Number 2 sounds great, but wouldn't you always want the latest scoring document since that should reflect the latest updatedb and rank

[jira] Commented: (NUTCH-55) Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available

2005-12-27 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-55?page=comments#action_12361301 ] byron miller commented on NUTCH-55: --- You can close this ticket, duplicate of ticket NUTCH-59 Create dmoz.org search plugin - incorporate the dmoz.org title/category

failure with crawl using 12/23 trunk

2005-12-23 Thread Byron Miller
Not sure if its because i have some of the older 7.x parameters for my plugins - did these change in trunk? 051223 194716 crawl-20051223193201/crawldb/current/part-0/data:0+809491 051223 194716 map 100% 051223 194717 crawl-20051223193201/linkdb/current/part-0/data:0+1270873 -adding

Re: IndexSorter optimizer

2005-12-21 Thread Byron Miller
I've got 400mill db i can run this against over the next few days. -byron --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the

Re: [VOTE] Commiter access for Stefan Groschupf

2005-12-16 Thread Byron Miller
+1 Thanks for all the hard work! Very much appreciated --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, During the past year and more Stefan participated actively in the development, and contributed many high-quality patches. He's been spending considerable effort on addressing many

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2005-12-07 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359649 ] byron miller commented on NUTCH-134: I would take more cpu for better summaries any day :) cpu power is cheaper than manual intervention! If any testing is needed, don't

standard version of log4j

2005-11-07 Thread Byron Miller
Is there any way to make sure all plugins/modules reference a standard version of log4j? seems to me there are atlest 3 different versions (although minor) # find . | grep log4 ./plugins/parse-pdf/log4j-1.2.9.jar ./plugins/parse-pdf/PDFBox-0.7.2-log4j.jar ./plugins/parse-rss/log4j-1.2.6.jar

RE: Halloween Joke at Google

2005-11-02 Thread Byron Miller
I wish it did have something to do with halloween :) Google tells no lies! :P --- Nick Lothian [EMAIL PROTECTED] wrote: If you just do the search you'll see a link at the side of the page: Why these results? These results may seem politically slanted. Here's what happened.

RE: Halloween Joke at Google

2005-11-02 Thread Byron Miller
Actually, to add fuel to the fire, using nutch out of the box, searching for miserable failure yields the same thing. http://www.mozdex.com/search.jsp?query=miserablefailure --- Fuad Efendi [EMAIL PROTECTED] wrote: Thanks Nick, So this is why some search engines are not honest. I mean the

Re: Halloween Joke at Google

2005-11-02 Thread Byron Miller
is still much smaller than Googles, it is amazing how closely the results can match! Makes you wonder just how much of the net is usefull ;) -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Byron Miller wrote: Actually, to add fuel to the fire, using nutch out of the box, searching

Re: NekoHTML 0.9.5

2005-11-01 Thread Byron Miller
I'll give tagsoup a try, i saw that was in there. thanks for the headsup! -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Byron Miller wrote: http://people.apache.org/~andyc/neko/doc/html/changes.html Any chance of getting that rolled in? Has a few fixes that look good

[jira] Commented: (NUTCH-39) pagination in search result

2005-10-30 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-39?page=comments#action_12356374 ] byron miller commented on NUTCH-39: --- I'm using the above code snippet on mozdex and run across some strange issues.. for example if you search for cnn.com it doesn't show up

[jira] Commented: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-10-25 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_12355864 ] byron miller commented on NUTCH-49: --- Can something like this be adapted to use the regex filter as well? it would be nice to say new only and match urls of x type or x link