Generator is building fetch list using *lowest* scoring URLs
Key: NUTCH-348
URL: http://issues.apache.org/jira/browse/NUTCH-348
Project: Nutch
Issue Type: Bug
Hi,
There was recently discussion on perhaps starting a new Lucene
sub-project, named Tika, to create a general-purpose library from the
parser components and other features in Nutch that might interest a
wider audience. To keep things rolling we've created a temporary
staging area for the
Port Nutch to use Hadoop Text instead of UTF8
-
Key: NUTCH-349
URL: http://issues.apache.org/jira/browse/NUTCH-349
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Hi,
I have some questions about the dependencies of the Parser interface,
especially from the perspective of generalizing it to the potential
Tika project. The current dependencies are:
* Configurable - depends on the Hadoop configuration system
* Pluggable - depends on the Nutch plugin
Hi there,
I have noticed that nutch does not care which language is selected.
When I do a search, nutch ignores which request.parameter was sent,
instead it always uses the browsers config.
Does anyone know what has to be changed?
I tried to change the search.jsp but it will not be applied in
[
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428399 ]
Sami Siren commented on NUTCH-349:
--
I anything at all should be done then I'd go for #2. There was also a total
incombatibility from 0.7 to 0.8 and I didn't see
Hi Jukka,
Thanks for your email. Indeed, there was discussion on the Lucene PMC email
list, about the Tika project. It was decided by the powers that be to
discuss it more on the Nutch mailing list before moving forward with any
vote on making Tika a sub-project of Apache Lucene. With regards to
Hi Steven,
On 8/16/06 7:36 AM, steven shingler [EMAIL PROTECTED] wrote:
(This thread moved from the User List.)
OK Lukas, lets open it up to the dev list! :)
Particularly, does the group feel moving to Maven would be _a good thing_ ?
+1
I suggested this (however did not make any
I am an outsider, I hope my comments do not cause grief.
AFAIK, Maven is transitioning to version 2. The newer version is more
attractive, but noticeably different than version 1 in implementation
but, closer to home, in configuration.
It might be wise to explore some of the comments from other,
Hi,
I have almost no experience with maven subprojects but somehow I feel
this could help us with Nutch plugins. Am I correct?
In maven we can always call ant goals as well and Jelly is a fun to
use. With maven one of the biggest benefit would be that eclipse (or
other IDE) classpath settings
Lukas Vlcek wrote:
Hi,
I have almost no experience with maven subprojects but somehow I feel
this could help us with Nutch plugins. Am I correct?
In maven we can always call ant goals as well and Jelly is a fun to
use. With maven one of the biggest benefit would be that eclipse (or
other IDE)
Chris Mattmann wrote:
However, the current Nutch software contains many value-added pieces of
code that are monolithically packaged together. If the services and
capabilities from the code were provided as separate, modular component
libraries, such services and capabilities could benefit many
Hi,
On 8/16/06, Sami Siren [EMAIL PROTECTED] wrote:
IMO to solve the main problem one does not need to set up another
project, just refactor and repackage.
I'd be happy either way, as long as I get a nice reusable library to
use in Jackrabbit. :-)
I think the key question on whether to
Hi,
I've just written an protocol-smb, it's really simple (code attached).
It uses the jcifs lib and seems to work - but there is some stuff I'd
like to discuss...
Nutch is glued to URL, which works if you write an URLHandler. No
Problem so far, but you can't install an URLHandler
Le Mercredi 16 Août 2006 17:18, Sami Siren a écrit :
Lukas Vlcek wrote:
Hi,
I have almost no experience with maven subprojects but somehow I feel
this could help us with Nutch plugins. Am I correct?
In maven we can always call ant goals as well and Jelly is a fun to
use. With maven
Hi,
I have just noticed that there is going on some activity called Tika
(http://code.google.com/p/tika/) and these guys are starting directly
with Maven2 (am I right?).
The more Nutch/Lucene/Hadoop/[Tika]/[?] thing grows the more
sophisticated project management tool will be needed I think.
Hi
It seems to me that Nutch does not send a HTTP Accept Header. Is that on
purpose?
I would have expected that Nutch tells the server which mime-types it
accepts resp. is able to parse and index,
but maybe I misunderstand something.
Thanks
Michi
--
Michael Wechner
Wyona - Open
[
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ]
Stefan Groschupf commented on NUTCH-349:
my vote goes to #2.
Having a tool that need to be started manually would be better than complicate
the already
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ]
Stefan Groschupf commented on NUTCH-233:
Hi Otis,
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]
Stefan Groschupf updated NUTCH-348:
---
Attachment: sortPatchV1.patch
What people think about this kind of solution?
Generator is building fetch list using *lowest* scoring URLs
20 matches
Mail list logo