[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument

2013-08-19 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743621#comment-13743621
 ] 

lufeng commented on NUTCH-1619:
---

Hi Yasin, Do you forget to close the data store? good.

 Writes Dmoz Description and Title information to db with snippet argument
 -

 Key: NUTCH-1619
 URL: https://issues.apache.org/jira/browse/NUTCH-1619
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1
Reporter: Yasin Kılınç
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-DMOZ-Snippet.patch


 We need Dmoz information of fetched URLs can be written to database. So these 
 information can be used like snipppet by indexer of the search engine we are 
 working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: crawl.gen.delay

2013-08-19 Thread feng lu
yes, it is used in Nutch 1.x , but never used in Nutch 2.x. because in
Nutch 2.x it will never generate selected url.

the correct expression of crawl.gen.crawl is milliseconds you can check the
Nutch 1.x nutch-default.xml. the property description like this:

property
  namecrawl.gen.delay/name
  value60480/value
  description
   This value, expressed in milliseconds, defines how long we should keep
the lock on records
   in CrawlDb that were just selected for fetching. If these records are
not updated
   in the meantime, the lock is canceled, i.e. they become eligible for
selecting.
   Default value of this is 7 days (60480 ms).
  /description
/property

Maybe it is wrong.

On Fri, Aug 16, 2013 at 3:17 AM, kaveh minooie ka...@plutoz.com wrote:

 crawl.gen.delay





-- 
Don't Grow Old, Grow Up... :-)


[jira] [Commented] (NUTCH-1623) Implement file.content.ignored function

2013-08-19 Thread Osy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743856#comment-13743856
 ] 

Osy commented on NUTCH-1623:


Sure Lewis, For Nutch 2.2.1 in nutch-default.xml there is a description for 
this functionality (!! NO IMPLEMENTED YET !!):

If true, no file content will be saved during fetch.
  And it is probably what we want to set most of time, since file:// URLs
  are meant to be local and we can always use them directly at parsing
  and indexing stages. Otherwise file contents will be saved.

Exactly what I need.

Thanks

 Implement file.content.ignored function
 ---

 Key: NUTCH-1623
 URL: https://issues.apache.org/jira/browse/NUTCH-1623
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, fetcher
Affects Versions: 2.2, 2.2.1
Reporter: Osy



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


nofollow behaviour [#NUTCH-693]

2013-08-19 Thread Santiago M. Mola
Hi,

I've experimented with Nutch for crawling Tor hidden services and I still
find an annoying issue that requires a patched Nutch version. That is
#NUTCH-693 [1]

This issue is a request for an option to control the behaviour of Nutch
when getting a rel=nofollow link. Currently, Nutch always ignores such
links and there is no way of configuring this behaviour without patching it.

The issue was closed with little discussion claiming that such option would
be the same as an hypothetical ignore.robotstxt option. This is not the
case. robots.txt is the way for webmasters to prevent crawlers to access
certain URLs. This is *not* the job of nofollow. robots.txt is always
controlled by the webmaster and, as such, it makes sense to strictly honouw
it. On the other hand, nofollow is always controlled by third parties
(otherwise, robots.txt should be used) and its well-established use is
indicating non-endorsement to an URL. That is, in practice, preventing
giving link-juice to potential spammers.

nofollow is not meant to be an access control mechanism. nofollow is not
meant to protect websites from crawler abuse either. That is robots.txt's
job. So there is no point in treating them as the same.

Now, there are very real use cases for following links with the
rel=nofollow attribute. In a loosely connected portion of the web,
following these links might be the only sane way to crawl successfully.

The Tor deepweb is a very clear case. There is a site which is very central
in the Tor link-graph: The Hidden Wiki. It is a great seed for crawling
Tor. But it's MediaWiki-based. And that means that every external link is
tagged as rel=nofollow. Finding enough good seed URLs to crawl Tor
without going through rel=nofollow links is not trivial at all.

The same might happen when crawling corporate intranets, I2P or other
networks.

So there is a clear use case for adding an option for following
rel=nofollow links. And, as far as I know, there is no point in not
adding it. That is why I would like this to be discussed and, if deemed
sensible, #NUTCH-693 reopened.

[1] https://issues.apache.org/jira/browse/NUTCH-693

Best,
-- 
Santiago M. Mola
Jabber ID: cooldw...@gmail.com


[jira] [Created] (NUTCH-1626) Homebrew formula for installing Nutch in Mac OS X

2013-08-19 Thread Andrew Pennebaker (JIRA)
Andrew Pennebaker created NUTCH-1626:


 Summary: Homebrew formula for installing Nutch in Mac OS X
 Key: NUTCH-1626
 URL: https://issues.apache.org/jira/browse/NUTCH-1626
 Project: Nutch
  Issue Type: Improvement
  Components: build
 Environment: Homebrew (http://brew.sh/)
Mac OS X 10.5+
Reporter: Andrew Pennebaker
Priority: Minor


Manually installing nutch takes time and effort out of a developer's day. It 
would be a great convenience to have an install formula for Homebrew for Mac 
users!

I have begun working on such a formula:

https://github.com/mxcl/homebrew/pull/22004

After `brew install nutch`, you can run `nutch`, but the associated tools like 
`nutch junit` aren't working for some reason.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1627) Debian package for installing nutch

2013-08-19 Thread Andrew Pennebaker (JIRA)
Andrew Pennebaker created NUTCH-1627:


 Summary: Debian package for installing nutch
 Key: NUTCH-1627
 URL: https://issues.apache.org/jira/browse/NUTCH-1627
 Project: Nutch
  Issue Type: Improvement
  Components: build
 Environment: Ubuntu 12.04 Precise
Reporter: Andrew Pennebaker
Priority: Minor


The simpler it is to install nutch, the easier it is to start using it. Could 
we please create a build task for generating a .deb installer for Debian/Ubuntu?

Eventually, it would be great to have a PPA, and then an official package in 
the Ubuntu apt repo.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1628) Chocolatey package for Windows users

2013-08-19 Thread Andrew Pennebaker (JIRA)
Andrew Pennebaker created NUTCH-1628:


 Summary: Chocolatey package for Windows users
 Key: NUTCH-1628
 URL: https://issues.apache.org/jira/browse/NUTCH-1628
 Project: Nutch
  Issue Type: Improvement
  Components: build
 Environment: Chocolatey (http://chocolatey.org/)
Windows XP+
Reporter: Andrew Pennebaker
Priority: Minor


Setting up developer tools in Windows can be a trial. If we provided a 
Chocolatey package for nutch, it could bring more Windows users into the fold, 
encouraging them to use nutch as a dependency in larger software systems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira