Tomcat adds file:/// to searcher.dir path

2010-12-23 Thread alxsss
Hello, I have installed nutch-1.2 in Fedora 14 and tomcat6. I added path to crawl dir in searcher.dir property in WEB_INF/classes/nutch-default.xml as /home/user/nutch-1.2/crawl I see in catalina.out file WARN SearchBean - Neither file:///home/user/nutch-1.2/crawl/index nor

failed with: java.net.UnknownHostException

2010-12-27 Thread alxsss
Hello I use nutch-1.2 with fedora 14 and try to index about 4000 domains. I use bin/nutch crawl urls -dir crawl -depth 3 topN -1 and have in crawl-urlfilter.txt this # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)* I noticed that if a domain has entered like http://mydomain.com in

unnecessary results in search

2011-01-03 Thread alxsss
Hello, I used nutch-1.2 to index a few domains. I noticed that nutch correctly crawled all sub-pages of domains. By sub-pages I mean the followings, for example for a domain mydomain.com all links inside it like mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that

Re: unnecessary results in search

2011-01-04 Thread alxsss
Hello, Thanks you for your response. Let me give you more detail of the issue that I have. First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com

Re: Exception on segment merging

2011-01-04 Thread alxsss
Which command did you use? Merging segments is very expensive in resources, so I try to avoid merging them. -Original Message- From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com To: user user@nutch.apache.org Sent: Tue, Jan 4, 2011 7:12 am Subject: FW: Exception on segment

Re: unnecessary results in search

2011-01-06 Thread alxsss
One more thing I just noticed is that Nutch search results do not display information from meta tag. Google and yahoo does. In more details, Nutch search results for keyword mydomain.com displays some short text from page mydomain.com. In contrary, google and yahoo search results for the

Re: unnecessary results in search

2011-01-10 Thread alxsss
Hello, Just noticed that google actually has results from all subpages of mydomain.com for keyword mydomain.com but they are hidden in a link show more results from mydomain.com. Is there a way of putting more results from the same domain in such a link in Nutch rss feed, since I use

Re: Few questions from a newbie

2011-01-26 Thread alxsss
you can put fetch external and internal links to false and increase depth. -Original Message- From: Churchill Nanje Mambe mambena...@afrovisiongroup.com To: user user@nutch.apache.org Sent: Wed, Jan 26, 2011 8:03 am Subject: Re: Few questions from a newbie even if the url

nutch crawl command takes 98% of cpu

2011-01-27 Thread alxsss
Hello, I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500 domains and I put fetch.external links to false. Is this normal? If

Re: Nutch search result

2011-02-18 Thread alxsss
2nd, after testing to fetch several pages from wikipedia, the search query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache ../wiki_dir returns It returns a result for keyword apache because that url has apache in it. -topN 50), it actually fetches some pages e.g. `fetching

Re: Starting web frontend

2011-02-24 Thread alxsss
Hello, I wondered if there is a way of adding to solrindex made from nutch segments another solrindex also made from nutch segments. I have to index about 3000 domains but 5 of them are newspaper sites. So, I need to crawl-fetch-parse these 5 domains(with depth 2) and update index every

Re: Reload index without restart tomcat.

2011-03-08 Thread alxsss
That tutorial is applicable for the new version too. -Original Message- From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com To: user user@nutch.apache.org; 'McGibbney, Lewis John' lewis.mcgibb...@gcu.ac.uk Sent: Tue, Mar 8, 2011 5:25 am Subject: RE: Reload index without

will nutch-2 be able to index image files

2011-03-08 Thread alxsss
Hello, I wondered if nutch version 2 be able to index image files? Thanks. Alex.

Re: will nutch-2 be able to index image files

2011-03-08 Thread alxsss
I meant to extract image title, src link and alt from img tags and not store image files. For a keyword search in must display link, which automatically displays image itself in the search page. Not sure what do you mean image content-based retrieval? Do image files have tags like mp3 ones?

Re: nutch crawl command takes 98% of cpu

2011-03-14 Thread alxsss
Hello, Which version this patch is applicable? Thanks. Alex. -Original Message- From: Alexis alexis.detregl...@gmail.com To: user user@nutch.apache.org Sent: Tue, Feb 8, 2011 9:59 am Subject: Re: nutch crawl command takes 98% of cpu Hi, Thanks for all the feedback. It

skip Urls regex

2011-03-17 Thread alxsss
Hello I see in nutch-1.2/conf/regex-urlfilter.txt file the following lines # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ However, nutch fetch urls like http://www.example.com/text/dev/faq/dev/content/2305/dev/content/246/ Thanks.

Re: Problem with Gora dependencies in trunk

2011-03-17 Thread alxsss
Hi, If you donwload gora and build it with ant you get rid of the one of the dependency --unresolved dependency: org.apache.gora#gora-core;0.1: not found if you change gora version from 1.0 to 1.0-incubator in one of the ivy files but this one --unresolved dependency:

Re: Problem with Gora dependencies in trunk

2011-03-17 Thread alxsss
Hi, Did you build gora with ant? I checked out from svn a few days ago and ant for gora gives error :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] ::

Re: Script failing when arriving at 'Solr' commands

2011-04-07 Thread alxsss
It seems to me that you may have the same problem as before with the disk space. This may happen because you do mergesegs. Try not to merge segments. Alex. -Original Message- From: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk To: user user@nutch.apache.org Sent: Wed, Apr

Re: will nutch-2 be able to index image files

2011-04-22 Thread alxsss
Hello, Looks like I will have some spare time in the next month, so I may work on writing this image indexing plugin. I wondered if there is a similar plugin to leverage code from or follow it? Thanks. Alex. -Original Message- From: Andrzej Bialecki a...@getopt.org To:

Re: Hosts File Nutch 1.0+

2011-04-26 Thread alxsss
It seems you should move www.example.com example.com from line 3 to line 1, uncomment line 3 and comment other lines. Alex. -Original Message- From: Alex alex.thegr...@ambix.net To: user user@nutch.apache.org Sent: Tue, Apr 26, 2011 4:18 am Subject: Re: Hosts File Nutch

keeping index up to date

2011-06-01 Thread alxsss
Hello, I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time. I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them. Thanks. Alex.

Re: keeping index up to date

2011-06-07 Thread alxsss
Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. Thanks. Alex.

ranking of search results

2011-07-22 Thread alxsss
Hello, I use nutch 1.2 and solr to index about 3500 domains. I noticed that search results for two or more keywords are not ranked properly. For example for keyword Lady Gaga some results that has Lady are displayed first then some results with both keywords and etc. It seems to me that results

Re: keeping index up to date

2011-07-26 Thread alxsss
Hello, One more question. Is there a way of adding new urls to crawldb created in previous crawls to include in subsequent recrawls? Thanks. Alex. -Original Message- From: lewis john mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org; markus.jelsma

Re: solrindex command` not working

2011-07-26 Thread alxsss
check for errors in solr log. -Original Message- From: Way Cool way1.wayc...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jul 26, 2011 3:14 pm Subject: Re: solrindex command` not working The latest solr version is 3.3. Maybe you can try that. On Tue, Jul 26, 2011 at 2:10 AM,

ranking in nutch/solr results

2011-07-30 Thread alxsss
Hello, I use nutch-1.2 with solr 1.4. Recently, I noticed that for search for a domain name, for example yahoo.com, yahoo.com is not in the first place. Instead other sites that has in content yahoo.com, are in the first places. I tested this issue with google. In its results domain is in the

Re: nutch redirect treatment

2011-08-17 Thread alxsss
https://issues.apache.org/jira/browse/NUTCH-1044 -Original Message- From: abhayd ajdabhol...@hotmail.com To: nutch-user nutch-u...@lucene.apache.org Sent: Wed, Aug 17, 2011 11:44 am Subject: nutch redirect treatment hi I have seen similar posts in this forum but still not

Re: nutch redirect treatment

2011-08-17 Thread alxsss
As far as I understood redirected urls are scored 0 and that is why fetcher does not pick them up in the earlier depths. They may be crawled starting depth 4 depending on the size of the seed list. -Original Message- From: abhayd ajdabhol...@hotmail.com To: nutch-user

Re: fetcher runs without error with no internet connection

2011-08-23 Thread alxsss
Hi Lewis, I stopped fetcher and started it on the same segment again. But before doing that I turned off modem and fetcher started giving Unknown.Host exception. It was not giving any error, with dsl failure, i.e. I was not able to connect to any sites. Again this is nutch-1.2. Thanks. Alex.

Re: fetcher runs without error with no internet connection

2011-08-30 Thread alxsss
It is the DNS problem, because it was giving a lot of UnknownHost exception. I decreased thread number to 5, but still DSL fails periodically. I wondered what is the common internet connection for fetching about 3500 domains. I currently have DSL with 3 Mps. Thanks. Alex. -Original

spellchecking in nutch solr

2011-09-01 Thread alxsss
Hello, I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it

Re: Crawl fails - Input path does not exist

2011-09-13 Thread alxsss
Comparing with nutch-1.2 I do not see any content folder under segments ones. Does this mean that we cannot put store.content to false in nutch1-3? Thanks. Alex. -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3334709.html Sent

Re: more from link

2011-09-14 Thread alxsss
I see what is in done in nutch results. Results are grouped with 1 doc in each group. I need to group with 3 max docs in each group. In Solr, it is impossible to paginate when grouping with more than 1 doc in each group. Google can do it with 5 docs in the first group, as I see. Thanks. Alex.

restart a failed job

2011-09-20 Thread alxsss
Hello, I wondered if it is possible to restart a failed job in nutch-1.3 version. I have this error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/ after fetching for 5 days. I know the reason for the error, but do not want to restart the

fetch command does not parse

2011-09-22 Thread alxsss
Hello, I tried fetch command with the following config property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher will store content./description /property property namefetcher.parse/name valuetrue/value descriptionIf true, fetcher will

Re: Removing urls from crawl db

2011-11-01 Thread alxsss
I think you must add a regex to regex-urlfilter.txt . In that case those urls will not be fetched by fetcher. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Tue, Nov 1, 2011 10:35 am Subject: Re: Removing urls from crawl db Already did

Re: how use NUTCH-16 in my nutch 1.3?

2011-11-03 Thread alxsss
I think this patch already included in the current version. -Original Message- From: mina tahereganji...@gmail.com To: nutch-user nutch-u...@lucene.apache.org Sent: Wed, Nov 2, 2011 7:08 pm Subject: how use NUTCH-16 in my nutch 1.3? i want to use NUTCH-61 in

Re: Fetching just some urls outside domain

2011-12-01 Thread alxsss
Hello, It is interesting to know how can one put a filter on outlinks? I mean if I have a regex, in which file should I put it? For example, I want nutch to ignore outlinks ending with .info. Thanks. Alex. -Original Message- From: Arkadi.Kosmynin arkadi.kosmy...@csiro.au To:

Re: Fetching just some urls outside domain

2011-12-01 Thread alxsss
If I understand you correctly, you state that even if my question is related to the current thread, nevertheless I must open a new one? -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Thu, Dec 1, 2011 3:01 pm Subject:

Re: how give several sites to nutch to crawl?

2011-12-03 Thread alxsss
I think you should add this to nutch-site.xml property namegenerate.max.count/name value1000/value descriptionThe maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. /description /property

Re: Can't crawl a domain; can't figure out why.

2011-12-20 Thread alxsss
It seems that robots.txt in libraries.mit.edu has a lot of restrictions. Alex. -Original Message- From: Chip Calhoun ccalh...@aip.org To: user user@nutch.apache.org; 'markus.jel...@openindex.io' markus.jel...@openindex.io Sent: Tue, Dec 20, 2011 7:28 am Subject: RE: Can't crawl

Re: Solrdedup fails due to date format

2012-02-01 Thread alxsss
Hello, I took a look to source of SolrDeleteDuplicates class. The patch is already applied. Any ideas what might be wrong? I issue this command bin/nutch solrdedup http://127.0.0.1:8983/solr/ and the solr schema is the one that comes with nutch. Thanks in advance. Alex.

Re: http.redirect.max

2012-03-01 Thread alxsss
Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex.

different fetch interval for each depth urls

2012-03-01 Thread alxsss
Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex.

Re: different fetch interval for each depth urls

2012-03-02 Thread alxsss
I need to make this as a cron job, so cannot do changes manually. My problem is to index newspaper sites, but only new links that are added every day and not fetch ones that have already been fetched. Thanks. Alex. -Original Message- From: Markus Jelsma

using less resources

2012-05-22 Thread alxsss
Hello, As far as I understood nutch recrawls urls when their fetch time has past current time regardless if those urls were modified or not. Is there any initiative on restricting recrawls to only those urls that have modified time(MT) greater than the old MT? In detail: if nutch have crawled

nutch-2.0 updatedb and parse commands

2012-06-18 Thread alxsss
Hello, It seems to me that all options to updatedb command that nutch 1.4 has, have been removed in nutch-2.0. I would like to know if this was done purposefully or they will be added later? Also, how can I create multiple doc using parse command? It seem there is no sufficient arguments to

Re: nutch-2.0 updatedb and parse commands

2012-06-19 Thread alxsss
Hi Lewis, In 1.X version there are -noAdditions options to updatedb and -adddays option to generate commands. How something similar to them can be done in 2.X version? Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated Modify code so that parser can generate multiple documents

Re: using less resources

2012-06-20 Thread alxsss
I was thinking of using last modified header, but it may be absent. In that case we could use signature of urls in the indexing time. I took a look to to code, it seems it is implemented but not working. I tested nutch-1.4 with a single url, solrindexer always sends the same number of documents to

parse and solrindex in nutch-2.0

2012-06-25 Thread alxsss
Hello, I have tested nutch-2.0 with hbase and mysql trying to index only one url with depth 1. I tried to fetch an html tag value and parse it to metadata column in webpage object by adding parse-tag plugin. I saw there is no metadata member variable in Parse class, so I used putToMetadata

Re: parse and solrindex in nutch-2.0

2012-07-02 Thread alxsss
Hi, Thank you for clarifications. Regarding the metadata, what would be a proper way of parsing end indexing multivalued tags in nutch-2.0 then? Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Wed, Jun 27, 2012 1:20

Re: parse and solrindex in nutch-2.0

2012-07-03 Thread alxsss
Hi, I was planning to parse img tags from a url content and put it in metadata filed of Webpage storage class in nutch2.0 to retrieve them later in the indexing step. However, since there is no metadata data type variable in Parse class (compare with outlinks) this can not be done in nutch

Re: updatedb in nutch-2.0 with mysql

2012-07-25 Thread alxsss
Not sure if I understood correctly. I did Counters c currentJob.getCounters(); System.out.println(c.toString()); With Mysql DbUpdaterJob: starting Counters: 20 DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=878298

Re: updatedb in nutch-2.0 with mysql

2012-07-26 Thread alxsss
I queried webpage table and there are a few links in outlinks column. As I noted in the original letter updatedb works with Hbase. This is the counters output in the case of Hbase. bin/nutch updatedb DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters

Re: updatedb in nutch-2.0 with mysql

2012-07-27 Thread alxsss
I tried your suggestion with sql server and everything works fine. The issue that I had was with mysql though. mysql Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1 After I have restarted mysql server and added to gora.properties mysql root user, updatdb adds outlinks as new

Re: Nutch 2.0 Solr 4.0 Alpha

2012-07-29 Thread alxsss
Which storage do you use? Try solrindex with option -reindex. -Original Message- From: X3C TECH t...@x3chaos.com To: user user@nutch.apache.org Sent: Sun, Jul 29, 2012 10:58 am Subject: Re: Nutch 2.0 Solr 4.0 Alpha Forgot to do Specs VMWare Machine with CentOS 6.3 On Sun, Jul 29,

Re: Why won't my crawl ignore these urls?

2012-07-30 Thread alxsss
Why do not you test your regex, to see if it really takes the urls you want to eliminate. It seems to me that your regex does not eliminate the type of urls you specified. Alex. -Original Message- From: Ian Piper ianpi...@tellura.co.uk To: user user@nutch.apache.org Sent: Mon, Jul

Re: Different batch id

2012-07-31 Thread alxsss
Hi, Most likely you run generate command a few times and did not run updatedb. So, each generate command assigned different batchId s to its own set of urls. Alex. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jul 31, 2012 10:26

updatedb fails to put UPDATEDB_MARK in nutch-2.0

2012-07-31 Thread alxsss
Hello, I noticed that updatedb command must remove gen, parse and fetch marks and put UPDATEDB_MARK mark. according to the code Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page); if (mark != null) { Mark.UPDATEDB_MARK.putMark(page, mark); } in DbUpdateReducer.java

Re: Nutch 2 solrindex

2012-08-01 Thread alxsss
This is directly related to the thread I have opened yesterday. I think this is a bug, since updatedb fails to put update mark. I have fixed it by modifying code. I have a patch, but not sure if I can send it as an attachment. Alex. -Original Message- From: Bai Shen

Re: Nutch 2 solrindex

2012-08-02 Thread alxsss
The current code putting updb_mrk in dbUpdateReducer is as follows Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page); if (mark != null) { Mark.UPDATEDB_MARK.putMark(page, mark); } the mark is always null, independent if there is PARSE_MARK or not. This function calls public

Re: Different batch id

2012-08-02 Thread alxsss
Hi, I have found out that, what happens after bin/nutch generate -topN 1000 is that only 1000 of the urls have been marked by gnmrk Then bin/nutch fetch -all skips all urls that do not have gnmrk according to the code Utf8 mark = Mark.GENERATE_MARK.checkMark(page); if

Re: Nutch 2 encoding

2012-08-09 Thread alxsss
Hi, I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem? Alex. -Original Message- From: Ake Tangkananond iam...@gmail.com To: user user@nutch.apache.org Sent: Thu, Aug 9, 2012 11:12 am Subject: Re: Nutch 2 encoding Hi, I'm debugging. I

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-11 Thread alxsss
Hello, I am getting the same error and here is the log 2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-11 Thread alxsss
I was able to do jstack just before the program exited. The output is attached. -Original Message- From: alxsss alx...@aim.com To: user user@nutch.apache.org Sent: Sat, Aug 11, 2012 2:17 pm Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hello, I am

updatedb error in nutch-2.0

2012-08-12 Thread alxsss
Hello, I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54) at

Re: updatedb error in nutch-2.0

2012-08-13 Thread alxsss
I found out that the key sent to unreverseUrl in DbUpdateMapper.map was :index.php/http This happened in the depth 3 and I checked seed file there was no line in the form of http:/index.php Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user

Re: nutch 2.0 with hbase 0.94.0

2012-08-13 Thread alxsss
did you delete the old hbase jar from the lib dir? Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Mon, Aug 13, 2012 10:16 am Subject: Re: nutch 2.0 with hbase 0.94.0 Nutch contains no knowledge of which specific

updatedb goes over all urls in nutch-2.0

2012-08-17 Thread alxsss
Hi, I noticed that updatedb command goes over all urls, even if they have been updated in the previous generate, fetch updatedb stages. As a result updatedb takes long time depending on the number of rows in the datastore. I thought maybe this is redundant and it must be restricted to not

fetcher fails on connection error in nutch-2.0 with hbase

2012-08-19 Thread alxsss
After fetching for about 18 hours fetcher throws this error java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at

speed of fetcher in nutch-2.0

2012-08-23 Thread alxsss
Hello, I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3 fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K urls per hour. Any ideas what could cause this decrease in speed. I use local mode with 10 threads. Thanks. Alex.

Re: recrawl a URL?

2012-08-24 Thread alxsss
This will work only for urls that has If-Modified-Since headers. But most urls does not have this header. Thanks. Alex. -Original Message- From: Max Dzyuba max.dzy...@comintelli.com To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Sent: Fri, Aug 24, 2012

Re: Nutch 2 solrindex fails with no error

2012-09-17 Thread alxsss
You can use -reindex option, since updt markers are not set properly in 2.0 release. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Mon, Sep 17, 2012 10:16 am Subject: Re: Nutch 2 solrindex fails with no error The problem appears

updatedb in nutch-2.0 increases fetch time of all pages

2012-09-17 Thread alxsss
Hello, updatedb in nutch-2.0 increases fetch time of all pages independent of if they have already been fetched or not. For example if updatedb is applied in depth 1 and page A is fetched and its fetchTime is 30 days from now, then as a result of running updatedb in depth 2 fetch time of page

Re: Building Nutch 2.0

2012-10-01 Thread alxsss
It seems to me that if you run nutch in deploy mode and make changes to config files then you need to rebuild .job file again unless you specify config_dir option in hadoop command. Alex. -Original Message- From: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org

nutch-2.0 generate in deploy mode

2012-10-01 Thread alxsss
Hello, I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate command takes 87% of cpu in deploy mode versus 18% in local mode. Any ideas how to fix this issue? Thanks. Alex.

Re: Building Nutch 2.0

2012-10-02 Thread alxsss
According to code in bin/nutch if you have .job file in you NUTCH_HOME then it means that you run it in deploy mode. If there is no .job file then you run it in local mode, so you do not need to build nutch each time you change conf files. Alex. -Original Message- From:

Re: Error parsing html

2012-10-02 Thread alxsss
Can you provide a few lines of log or the url that gives the exception? -Original Message- From: CarinaBambina carina.rei...@yahoo.de To: user user@nutch.apache.org Sent: Tue, Oct 2, 2012 2:04 pm Subject: Re: Error parsing html Thanks for the reply. I'm now using Nutch 1.5.1, but

Re: Error parsing html

2012-10-09 Thread alxsss
I checked the url you privided with parsechecker and they are parsed correctly. You can check yourself by doing bin/nutch parsechecker yoururl. In you implementation can you check if segment dir has correct permission. Alex. -Original Message- From: CarinaBambina

nutch-2.0-fetcher fails in reduce stage

2012-10-15 Thread alxsss
Hello, I try to use nutch-2.0, hadoop-1.03, hbase-0.92.1 in pseudo distributed mode with iptables turned off. As soon as map reaches 100%, fetcher works for a few minutes and fails with the error java.net.ConnectException: Connection refused at

Re: nutch-2.0-fetcher fails in reduce stage

2012-10-17 Thread alxsss
Hello, Today, I closely followed all hbase and hadoop logs. As soon as map reached 100% reduce was 33%. Then when reduce reached 66% I saw in hadoop's datanode log the following error 2012-10-16 22:44:54,634 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:

Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread alxsss
Hello, I think the problem is with the storage not nutch itself. Looks like generate cannot read status or fetch time (or gets null values) from mysql. I had a bunch of issues with mysql storage and switched to hbase at the end. Alex. -Original Message- From: Sebastian Nagel

Re: Same pages crawled more than once and slow crawling

2012-10-19 Thread alxsss
Hello, I meant that it could be a gora-mysql problem. In order to test it, you can run nutch in local mode with Generator Debug enabled. Put this log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout in your conf/log4j.properties and run the crawl cycle with updatedb. if gora-mysql

Re: Image search engine based on nutch/solr

2012-10-21 Thread alxsss
Hello, I have also written this kind of plugin. But instead of putting thumbnail files in solr index they are put in a folder. Only, filenames are kept in the solr index. I wondered what is the advantage of putting thumbnail files in the solr index? Thanks in advance. Alex.

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

2012-11-01 Thread alxsss
: Thank you alxsss for the suggestion. It displays the actualSize and inHeaderSize for every file and two more lines in logs but it did not much information even when i set parserJob to Debug. I had the same problem when i re-compiled everything today. I have to run the parse command

Re: Access crawled content or parsed data of previous crawled url

2012-11-28 Thread alxsss
It is not clear what you try to achieve. We have done something similar in regard of indexing img tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -Original Message- From:

Re: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread alxsss
Hi, Unfortunately, my employer does not want me to disclose details of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-04 Thread alxsss
move or copy that jar file to local/lib and try again. hth. Alex. -Original Message- From: Arcondo arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 2:55 am Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hope that

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-04 Thread alxsss
Which version of nutch is this? Did you follow the tutorial? I can help yuu if you provide all steps you did, starting with downloading nutch. Alex. -Original Message- From: Arcondo Dasilva arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 1:23 pm

Re: Native Hadoop library not loaded and Cannot parse sites contents

2013-01-07 Thread alxsss
Hi, You can unjar the jar file, check if the class that parse complains about is inside it. You can also try to put content of jar file under local /lib. Maybe there is some read restriction. If this does not help, I can only suggest to start again with a new copy of nutch. Alex.

nutch/util/NodeWalker class is not thread safe

2013-01-16 Thread alxsss
Hello, I use this class NodeWalker at src/java/org/apache/nutch/util/NodeWalker.java in one of our plugins. I noticed this comment //Currently this class is not thread safe. It is assumed that only one thread will be accessing the codeNodeWalker/code at any given time. above the class

Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread alxsss
I see that inlinks are saved as ol in hbase. Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jan 30, 2013 9:31 am Subject: Re: Nutch 2.0 updatedb and gora query Link to the reference (

Re: Nutch 2.0 updatedb and gora query

2013-01-30 Thread alxsss
What do you call inlinks? I call inlink for mysite.com all urls as mysite.com/myhtml1.html, mysite.com/myhtml2.html and etc. Currently they are saved as ol in hbase. from hbase shell do this get 'webpage', 'com.mysite:http/' and check what ol family looks like. I have these config property

Re: Nutch 1.6 +solr 4.1.0

2013-02-06 Thread alxsss
Hi, Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with solr-4.1.0. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 6, 2013 6:13 pm Subject: Re: Nutch 1.6 +solr 4.1.0 Hi, We are

Re: Nutch 2.1 + HBase cluster settings

2013-02-06 Thread alxsss
Hi, So, you do not run hadoop and nutch job works in distributed mode? Thanks. Alex. -Original Message- From: k4200 k4...@kazu.tv To: user user@nutch.apache.org Sent: Wed, Feb 6, 2013 7:43 pm Subject: Re: Nutch 2.1 + HBase cluster settings Hi Lewis, There seems to be a bug

Re: Nutch identifier while indexing.

2013-02-13 Thread alxsss
Are you telling that your sites have form siteA.mydomain.com, siteB.mydomain.com, siteC.mydomain.com? Alex. -Original Message- From: mbehlok m_beh...@hotmail.com To: user user@nutch.apache.org Sent: Wed, Feb 13, 2013 11:05 am Subject: Nutch identifier while indexing. Hello, I

nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hello, I noticed that nutch cannot retrieve title and inlinks of one of the domains in the seed list. However, if I run identical code from the server where this domain is hosted then it correctly parses it. The surprising thing is that in both cases this urls has status: 2 (status_fetched)

Re: nutch cannot retrive title and inlinks of a domain

2013-02-13 Thread alxsss
Hi, I noticed that for other urls in the seed inlinks are saved as ol. I checked the code and figured out that this is done with the part that saves anchors. So, in my case inlinks are saved as anchors in the field ol in hbase. But, for one of the ulrs, titile and inlinks are not retrieved,

fields in solrindex-mapping.xml

2013-02-14 Thread alxsss
Hello, I see that there are field dest=segment source=segment/ field dest=boost source=boost/ field dest=digest source=digest/ field dest=tstamp source=tstamp/ fields in addition to title, host and content ones in nutch-2.x'

  1   2   >