[jira] Commented: (NUTCH-811) Develop an ORM framework

2010-05-07 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865226#action_12865226
 ] 

Enis Soztutar commented on NUTCH-811:
-

Hi Piet,
The code for Gora will reside in GitHub for now, since Nutch and Gora are 
pretty orthogonal. But as stated before, Nutch is the first user of Gora, and 
Gora does not yet have a separate community so I intend to always keep nutch 
community updated (via this issue and nutch-dev mailing list), and hope for 
feedback from the Nutch community.

Moreover, NutchBase has already been ported to using Gora, so at some point, 
Gora should be reviewed and accepted as a dependency for Nutch.

 Develop an ORM framework 
 -

 Key: NUTCH-811
 URL: https://issues.apache.org/jira/browse/NUTCH-811
 Project: Nutch
  Issue Type: New Feature
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, 
 so that different backends can be used to store data. 
 This issue will track the development of the ORM layer. Initially full 
 support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support 
 scheduled for later. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-26 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar closed NUTCH-808.
---

Resolution: Fixed

We have decided to go on with implementing an ORM layer as per the discussion 
on NUTCH-811. Closing this issue. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856360#action_12856360
 ] 

Enis Soztutar commented on NUTCH-808:
-

bq. What do you mean by current implementation? NutchBase?
Indeed. In package o.a.n.storage deals with ORM (though not all classes)

bq. I know that Cascading have various Tape/Sink implementations including 
JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how 
they do it?
The way cascading does this is to convert Tuples (cascading data structure) to 
HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since 
they deal with only tuple - table row, it is not that difficult. But again, 
cascading does not allow for mapping lists to columns, etc. 

bq. My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 
Yeah, the more I think about it, the more I come to terms with custom 
implementation. However, I think we might benefit a lot from the ideas from JDO 
in the long term. Also, JDBC implementation may not be relevant for large scale 
deployments, but it will be a very nice side effect of the ORM layer, which 
will allow easy deployment, which in turn will hopefully bring more users. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-12 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856124#action_12856124
 ] 

Enis Soztutar commented on NUTCH-808:
-

So, this is the results so far : 

DataNucleus was previously known as JPOX and it was the reference 
implementation for Java Data objects (JDO). JDO is a java standard for 
persistence. A similar specification, named JPA is also a persistence standard, 
which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will 
not be useful for us 
(http://www.datanucleus.org/products/accessplatform/persistence_api.html). 

In JDO, the first step is to define the domain objects as POJOs. Then, the 
persistance metadata is specified either using annotations, XML or both. Then a 
byte code enhancer uses instrumentation to add required methods to the classes 
defined as @PersistanceCapable. The database tables can be generated by hand, 
automatically by datanucleus, or by using a tool (SchemaTool). 
The persistence layer uses standard JDO syntax, which is similar to JDBC. The 
objects can be queried using JPQL. 

I have run a small test to persist objects of WebTableRow class (from NutchBase 
branch) to both MySQL and HBase. Although it took me a fair bit of time to 
set-up, I was able to persist objects to both. 

However, although it is possible to map complex fields (like lists, maps, 
arrays, etc) to RDBMs using different strategies (such as serializing directly, 
using Joins, using Foreign Keys), I was not able to find a way to leverage 
HBase data model. For example, we want to be able to map lists and maps to 
columns in column families. Without such functionality using column oriented 
stores does not bring any advantage. 

For the byte[] serialization for MapReduce, we can either implement a new 
datastore for datanucleus, which also implements Hadoop's Serialization, or use 
Avro to generate Java classes to be feed into JPOX enhancer, or else manually 
implement Writable. 

To sum up, datanucleus brings the following advantages :
- out of the box RDBMs support 
- XML or annotation metadata
- JDO is a Java standard 
- standard query interface
- JSON support

The disadvantages to use DataNucleus would be:
- JDO is rather complex, Implementing a datastore is not very trivial
- We need write patches to datanucleus to flexibly map complex fields to 
leverage HBase's data model
- We have no control on the source code
- no native Hbase support (for example using filters, etc)

On the other hand, current implementation is 
- tested on production, 
- can leverage HBase data model, 
- can be modified to work with Avro serialization directly, 
- cassandra support could be added with little effort
- can support multiple languages (in the future)

I believe that having SQLite, MySQL and HBase support is critical for Nutch 
2.0, for out-of-the-box use, ease of deployment and real-scale computing 
respectively. But obviously we cannot use DataNucleus out of the box either. 


ORM is inherently a hard problem. I propose we go ahead and make the changes to 
DataNucleus to see if it is feasible, and continue with it if it suits our 
needs. Of course, having a custom framework will also be great, so any feedback 
would be more than welcome. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)
Evaluate ORM Frameworks which support non-relational column-oriented datastores 
and RDBMs 
--

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar


We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler 
to compile class definitions given in JSON. Before moving on with this, we 
might benefit from evaluating other frameworks, whether they suit our needs. 

We want at least the following capabilities:
- Using POJOs 
- Able to persist objects to at least HBase, Cassandra, and RDBMs 
- Able to efficiently serialize objects as task outputs from Hadoop jobs
- Allow native queries, along with standard queries 




Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852840#action_12852840
 ] 

Enis Soztutar commented on NUTCH-808:
-

A candidate framework is DataNucleus. It has the following benefits. 

- Apache 2 license. 
- JDO support 
- HBase, RDBMS, XML persistance. 

I will further investigate whether we can integrate Hadoop writables/Avro 
serialization so that objects can be passed from Mapred. 


 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2008-10-07 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637489#action_12637489
 ] 

Enis Soztutar commented on NUTCH-442:
-

I personally believe this patch should be in before 1.0, since it does not make 
sense to make such a change in 1.1. However since there is some need to test 
this patch more thoroughly, I guess we can make a branch and commit it there, 
so that people can test this easily. However branching has it's own problems, 
especially keeping in sync with trunk would get harder and harder. 

Since this issue has a large number of votes and watchers, I suggest we branch 
and commit it, test this out a little bit more, and merge to trunk before 1.0. 

 Integrate Solr/Nutch
 

 Key: NUTCH-442
 URL: https://issues.apache.org/jira/browse/NUTCH-442
 Project: Nutch
  Issue Type: New Feature
 Environment: Ubuntu linux
Reporter: rubdabadub
 Attachments: Crawl.patch, Indexer.patch, NUTCH-442_v4.patch, 
 NUTCH-442_v5.patch, NUTCH-442_v6.patch.txt, NUTCH-442_v7.patch.txt, 
 NUTCH-442_v7a.patch.txt, NUTCH_442_v3.patch, 
 RFC_multiple_search_backends.patch, schema.xml


 Hi:
 After trying out Sami's patch regarding Solr/Nutch. Can be found here 
 (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
  and I can confirm it worked :-) And that lead me to request the following :
 I would be very very great full if this could be included in nutch 0.9 as I 
 am trying to eliminate my python based crawler which post documents to solr. 
 As I am in the corporate enviornment I can't install trunk version in the 
 production enviornment thus I am asking this to be included in 0.9 release. I 
 hope my wish would be granted.
 I look forward to get some feedback.
 Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-588) Help Need

2007-12-07 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar resolved NUTCH-588.
-

Resolution: Invalid

Jira is not for asking questions. You should ask your questions on nutch-user 
mailing list. See http://lucene.apache.org/nutch/mailing_lists.html
Closing this issue as invalid. 

 Help Need
 -

 Key: NUTCH-588
 URL: https://issues.apache.org/jira/browse/NUTCH-588
 Project: Nutch
  Issue Type: Task
  Components: indexer
Affects Versions: 0.7.2
 Environment: Linux
Reporter: Teccon Ingenieros

 Hello,
 We are trying to index a word file, if we put the static url like 
 (/servlet/jsp/documento.doc) it works ok, put if we try to do the same with 
 an dinamic url that generates that file (/servlet/jsp/leerFichero.jspid=112) 
 it does´t work, it does´t index our url.
 What can we do?
 Regards,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-586) Add option to run compiled classes w/o job file

2007-12-04 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548198
 ] 

Enis Soztutar commented on NUTCH-586:
-

Can someone review this ?

 Add option to run compiled classes w/o job file
 ---

 Key: NUTCH-586
 URL: https://issues.apache.org/jira/browse/NUTCH-586
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: run-core_v1.patch


 bin/nutch adds nutch-*.job files under build and base directory to the 
 classpath. However building the job file takes a long time. We have a target 
 compile-core which builds only the core classes w/o plugins, but we need a 
 way to run the compiled core class files. An option to bin/nutch to run the 
 classes compiled with ant compile-core seems enough. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-586) Add option to run compiled classes w/o job file

2007-12-04 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-586:


Attachment: run-core_v2.patch

bq. I think you also need to put a comment, which clarifies that this works 
only in the local Hadoop mode.
agreed. This patch addresses that.  

 Add option to run compiled classes w/o job file
 ---

 Key: NUTCH-586
 URL: https://issues.apache.org/jira/browse/NUTCH-586
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: run-core_v1.patch, run-core_v2.patch


 bin/nutch adds nutch-*.job files under build and base directory to the 
 classpath. However building the job file takes a long time. We have a target 
 compile-core which builds only the core classes w/o plugins, but we need a 
 way to run the compiled core class files. An option to bin/nutch to run the 
 classes compiled with ant compile-core seems enough. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-586) Add option to run compiled classes w/o job file

2007-11-30 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-586:


Attachment: run-core_v1.patch

Attached file adds -core option to bin/nutch. 

 Add option to run compiled classes w/o job file
 ---

 Key: NUTCH-586
 URL: https://issues.apache.org/jira/browse/NUTCH-586
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: run-core_v1.patch


 bin/nutch adds nutch-*.job files under build and base directory to the 
 classpath. However building the job file takes a long time. We have a target 
 compile-core which builds only the core classes w/o plugins, but we need a 
 way to run the compiled core class files. An option to bin/nutch to run the 
 classes compiled with ant compile-core seems enough. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-586) Add option to run compiled classes w/o job file

2007-11-30 Thread Enis Soztutar (JIRA)
Add option to run compiled classes w/o job file
---

 Key: NUTCH-586
 URL: https://issues.apache.org/jira/browse/NUTCH-586
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0
 Attachments: run-core_v1.patch

bin/nutch adds nutch-*.job files under build and base directory to the 
classpath. However building the job file takes a long time. We have a target 
compile-core which builds only the core classes w/o plugins, but we need a way 
to run the compiled core class files. An option to bin/nutch to run the classes 
compiled with ant compile-core seems enough. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-583) FeedParser empty links for items

2007-11-27 Thread Enis Soztutar (JIRA)
FeedParser empty links for items


 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0


FeedParser in feed plugin just discards the item if it does not have link 
element. However Rss 2.0 does not necessitate the link element for each 
item. 
Moreover sometimes the link is given in the guid element which is a globally 
unique identifier for the item. I think we can search the url for an item 
first, then if it is still not found, we can use the feed's url, but with 
merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

2007-11-16 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543067
 ] 

Enis Soztutar commented on NUTCH-573:
-

So,  how shall we proceed with this one?
I give +1 to commit this, and deal with NUTCH-479 in its own issue. Having both 
multi term queries and OR syntax wont be too bad i guess. 


 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

2007-11-14 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542389
 ] 

Enis Soztutar commented on NUTCH-573:
-

bq. Using commas is IMHO not intuitive

With your respect I should disagree. We cannot expect search users to type 
queries of the form +(site:www.somesite.com site:www.foo.com). Last time I 
checked google used comma syntax. I think that supporting only a subset of the 
query syntax that lucene supports was the initial intention to implement 
another query parser for nutch, so that ordinary search users will not get 
confused, and they can use the de-facto syntax.   

bq. Also, I'm not sure if the original reporter asked for a generic solution 
that would work with every field - if the issue at hand is just the site: 
field, then we can use raw field and a RawQueryFilter to parse multiple terms 
within the SiteQueryFilter implementation, without changing the global query 
syntax.
The original intention was to allow this in only site queries, howeve i cannot 
see a reason to not enable this for other fields. 




 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-573) Multiple Domains - Query Search

2007-11-14 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542449
 ] 

Enis Soztutar commented on NUTCH-573:
-

@Andrzej
I recall google over comma delimited syntax, but now it doesn't work. does it? 
Maybe I remembered wrong. 
http://www.google.com/intl/en/help/operators.html confirms that comma delimited 
syntax is not allowed, but a we can make allintitle: ... type queries. 

I think the raw fields, which are site, date, type and lang are unlikely to 
contain commas, so we may not have to worry about escape characters. As far as 
i know, we treat comma as white space, so searching comma-containing phrases in 
raw fields is not enabled anyway. Of course we may fix this should it be 
needed. 

@Dogacan 
I share the same concerns about performance and complexity about NUTCH-479. 
However it may be good if it were implemented correct. 

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-573) Multiple Domains - Query Search

2007-11-13 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar reassigned NUTCH-573:
---

Assignee: Enis Soztutar

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
Priority: Minor

 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

2007-11-13 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-573:


Attachment: multiTermQuery_v1.patch

Here is a patch that enables querying multiple values for the same field. 
#The query syntax is changed to enable  [field:]term1(,term2)* type queries, 
where multiple terms are converted to a boolean OR query. 
#Query.Clause, Query.Term, and Query.Phrase is changed significantly. 

This is an initial version of the patch for review, today I will test it a 
little bit more. 


 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
Priority: Minor
 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

2007-11-09 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541326
 ] 

Enis Soztutar commented on NUTCH-574:
-

Why don't you just refactor indexing anchor code into another plugin, say 
index-anchor, enabled by default. Then all you need to do is to not use that 
plugin but only index-basic, right? That way we can avoid adding to the 
never-ending-list of configuration parameters *smile*. 

bq. The current idea is to have three options. An always include, never 
include, and include if matches text on page. 
In another issue, we can add a new plugin called index-anchor-matching that 
does its thing. Choosing from a list of plugins is the beauty of the plugin 
system after all. 

 Including inlink anchor text in index can create irrelevant search results.
 ---

 Key: NUTCH-574
 URL: https://issues.apache.org/jira/browse/NUTCH-574
 Project: Nutch
  Issue Type: Bug
  Components: indexer
 Environment: All, basic indexing filter
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-574-1.patch


 Currently the basic indexing filter includes inbound anchor text for a given 
 URL in the index.  This sometimes allows pages to show up in search results 
 where they may not be relevant.  An example of this is a search for dallas 
 hotels in our production index (www.visvo.com).  Google would show up first 
 in this example although there is no text matching either dallas or hotels on 
 the google home page.  What is happening here is there are inlinks into 
 google with the words dallas and hotels which get included in the index for 
 google.com and because google would have a very high boost due to inlinks, 
 google shows up first for these search terms.  I propose we add an option to 
 allow/prevent inlink anchor text from being included in the index and set the 
 default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-574) Including inlink anchor text in index can create irrelevant search results.

2007-11-09 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541359
 ] 

Enis Soztutar commented on NUTCH-574:
-

Honestly, i don't think not indexing anchor words that do not appear in the web 
site text is not a wise solution. What made google so successful is indexing 
anchor text + PR, the classic example being that, the page 
http://www.honda.com/ never mentions that Honda is a car manufacturer, but the 
anchor text does.   

That said, I think we should focus on finding a way to eliminate the noise on 
anchor text. At this point we take the first 10K links and discard the others, 
due to size constraints. But a better way would be to select the best ones, or 
select the most frequent words, etc. 




 Including inlink anchor text in index can create irrelevant search results.
 ---

 Key: NUTCH-574
 URL: https://issues.apache.org/jira/browse/NUTCH-574
 Project: Nutch
  Issue Type: Bug
  Components: indexer
 Environment: All, basic indexing filter
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-574-1.patch


 Currently the basic indexing filter includes inbound anchor text for a given 
 URL in the index.  This sometimes allows pages to show up in search results 
 where they may not be relevant.  An example of this is a search for dallas 
 hotels in our production index (www.visvo.com).  Google would show up first 
 in this example although there is no text matching either dallas or hotels on 
 the google home page.  What is happening here is there are inlinks into 
 google with the words dallas and hotels which get included in the index for 
 google.com and because google would have a very high boost due to inlinks, 
 google shows up first for these search terms.  I propose we add an option to 
 allow/prevent inlink anchor text from being included in the index and set the 
 default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-10-26 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12537954
 ] 

Enis Soztutar commented on NUTCH-442:
-

Due to the method signature bug 
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6267833) for 
{{ExecutorService#invokeAll}} the patch will not compile against 1.5. We should 
manage the lists as ListCallableT.  

 Integrate Solr/Nutch
 

 Key: NUTCH-442
 URL: https://issues.apache.org/jira/browse/NUTCH-442
 Project: Nutch
  Issue Type: New Feature
 Environment: Ubuntu linux
Reporter: rubdabadub
 Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, 
 schema.xml


 Hi:
 After trying out Sami's patch regarding Solr/Nutch. Can be found here 
 (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
  and I can confirm it worked :-) And that lead me to request the following :
 I would be very very great full if this could be included in nutch 0.9 as I 
 am trying to eliminate my python based crawler which post documents to solr. 
 As I am in the corporate enviornment I can't install trunk version in the 
 production enviornment thus I am asking this to be included in 0.9 release. I 
 hope my wish would be granted.
 I look forward to get some feedback.
 Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-10-15 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534869
 ] 

Enis Soztutar commented on NUTCH-442:
-

Using nutch with solr has been a very demanding request, so it will be very 
useful when this makes into trunk. I have spend some time reviewing the patch, 
which I find quite elegant. 

Some improvements to the patch would be 
- make NutchDocument implement VersionedWritable instead of writable, and 
delegate version checking to superclass
- refactor getDetails() methods in HitDetailer to Searcher (it is not likely 
that a class would implement Searcher but not HitDetailer)
- use Searcher, delete HitDetailer and SearchBean 
- Rename XXXBean classes so that they do not include bean. (I think it is 
confusing to have bean objects that have non-trivial functionality)
- refactor LuceneSearchBean.VERSION to RPCSearchBean
- remove unrelated changes from the patch.(the changes in NGramProfile, 
HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong)

As far as i can see, we do not need any metadata for Solr backend, and only 
need Store,Index and Vector options for lucene backend, so i think we can 
simplify NutchDocument#metadata. We may implement :  
{code}
class FieldMeta {
o.a.l.document.Field.Store store;
o.a.l.document.Field.Index index;
o.a.l.document.Field.TermVector tv;
}

FieldMeta[] IndexingFilter.getFields();

class NutchDocument {
...
private ArrayListField fieldMeta;
...
}

{code}

Or alternatively we may wish to keep add methods of NutchDocument compatible 
with o.a.l.document.Document, keeping the metadata up-to-date as we add new 
fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will 
be slightly slower but the API will be much more intuitive. 

 Integrate Solr/Nutch
 

 Key: NUTCH-442
 URL: https://issues.apache.org/jira/browse/NUTCH-442
 Project: Nutch
  Issue Type: New Feature
 Environment: Ubuntu linux
Reporter: rubdabadub
 Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, 
 schema.xml


 Hi:
 After trying out Sami's patch regarding Solr/Nutch. Can be found here 
 (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
  and I can confirm it worked :-) And that lead me to request the following :
 I would be very very great full if this could be included in nutch 0.9 as I 
 am trying to eliminate my python based crawler which post documents to solr. 
 As I am in the corporate enviornment I can't install trunk version in the 
 production enviornment thus I am asking this to be included in 0.9 release. I 
 hope my wish would be granted.
 I look forward to get some feedback.
 Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-08-20 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521033
 ] 

Enis Soztutar commented on NUTCH-439:
-

Recently Matt Cutts have written about parts of the urls : 
http://www.mattcutts.com/blog/seo-glossary-url-definitions/

it seems that, as expected, google deals with different parts of the urls. 
*smile*

 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, 
 tld_plugin_v2.3.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-541) Index url field untokenized

2007-08-09 Thread Enis Soztutar (JIRA)
Index url field untokenized
---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0


Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
untokenized version of the url field in some contexts : 
1. For deleting duplicates by url (at search time). see NUTCH-455
2. For restricting the search to a certain url (may be used in the case of RSS 
search where each entry in the Rss is added as a distinct document with 
(possibly) same url ) 
   query-url extends FieldQueryFilter so: 
Query: url:http://www.apache.org/
Parsed: url:http http-www http-www-apache www www-apache apache org
Translated: +url:http-http-www http-www-http-www-apache 
http-www-apache-www www-www-apache www-apache apache org
3. for accessing a document(s) in the search servers in the search servers. 
(using query plugin)

I suggest we add url as in index-basic and implement a query-url-untoken 
plugin. 
doc.add(new Field(url, url.toString(), Field.Store.YES, 
Field.Index.TOKENIZED));
doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
Field.Index.UN_TOKENIZED));


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-27 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v2.3.patch

bq. TLDScoringFilter contains a misspelled field, tldEnties, it should be 
renamed to tldEntries
Done!
bq. one of the use cases for the tld index field that you mention is that 
users may search on it. But in the latest patch this field is added with 
Field.Index.NO, which makes searching on it impossible. Also, in order to 
search on arbitrary Lucene fields Nutch needs a Query filter, so we would need 
a TLDQueryFilter, which doesn't exist (yet?). 

Well, infact NUTCH-445 covers searching on tlds, namely we would be able to 
search site:lucene.apache.org, or site:apache.org or even site:org, therefore i 
think indexing tld fields and TLDQueryFilter is not needed. I will delve deeper 
into NUTCH-445 as soon as i find some time. We can move domain indexing 
functionality to index-basic so that it will be generic enough. 

bq. using domain names instead of host names - we need to discuss this further, 
let's create a separate issue on this. 
we  can open issues case by case since the patches is expected to have major 
side effects. 

 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, 
 tld_plugin_v2.3.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-19 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513819
 ] 

Enis Soztutar commented on NUTCH-518:
-

Since there is no ordering among scoring filters, if we do something specific 
to zero in OpicScoring, it might lead to nondeterministic behaviour. Let's say  
for example the code in OpicScoring is changed so that : 

public float indexerScore(Text url, Document doc, CrawlDatum dbDatum, 
CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) {
   if(initScore != 0)
  return (float)Math.pow(dbDatum.getScore(), scorePower) * initScore;
   else 
   //do smt nasty
}

then there will be a big difference if scoring-opic is run before or after 
scoring-foo. 
As far as i can tell from the massages in mailing lists, scoring filters are 
used for restricting the crawl to topics, so zero-handling might broke 
topic-specific crawls. So my vote is to keep current implementation. 

 Fix OpicScoringFilter to respect scoring filter chaining
 

 Key: NUTCH-518
 URL: https://issues.apache.org/jira/browse/NUTCH-518
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: opicScoring.chain.patch


 Opic Scoring returns the score that it calculates, rather than returning 
 previous_score * calculated_score. This prevents using another scoring filter 
 along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-19 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513826
 ] 

Enis Soztutar commented on NUTCH-518:
-

 I think removing initial score arguments and merging scores in 
 ScoringFilters.java is a good idea overall
+1 for this one. The final score should be calculated centrally. Maybe we may 
implement more than one way to calculate the score. Roughly ; 

ScoringFilters.getMultipliedScore()
ScoringFilters.getSummedScore() 
ScoringFilters.getGeometricMeanScore()


 Fix OpicScoringFilter to respect scoring filter chaining
 

 Key: NUTCH-518
 URL: https://issues.apache.org/jira/browse/NUTCH-518
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: opicScoring.chain.patch


 Opic Scoring returns the score that it calculates, rather than returning 
 previous_score * calculated_score. This prevents using another scoring filter 
 along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Enis Soztutar (JIRA)
Fix OpicScoringFilter to respect scoring filter chaining


 Key: NUTCH-518
 URL: https://issues.apache.org/jira/browse/NUTCH-518
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0


Opic Scoring returns the score that it calculates, rather than returning 
previous_score * calculated_score. This prevents using another scoring filter 
along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-517) build encoding should be UTF-8

2007-07-18 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-517:


Attachment: build.encoding.patch

Patch for UTF-8 is attached

 build encoding should be UTF-8
 --

 Key: NUTCH-517
 URL: https://issues.apache.org/jira/browse/NUTCH-517
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: build.encoding.patch


 build encoding send to javac should be UTF-8 so that non-ascii characters can 
 be used in the source code. This issue has emerged from NUTCH-439

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-18 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v2.2.patch

This patch includes core domain utilities and the tld plugin, but excludes 
the changes in NUTCH-517 and NUTCH-518. 

 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-10 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v2.0.patch

I have made major improvements to the code and configuration files. Mainly the 
issue is not only a plugin, but a package, one big xml file, and an 
indexing/scoring plugin(which is disabled by default). The list of recognized 
suffixes now is not limited to top level domains. second, or third level public 
domain names can be recognized. The patch also changes the naming from top 
level domains to domain suffixes. 

This patch also introduces URLUtil class, which include methods for getting 
domain name, or public domain suffix of an url. Finding the domain name of a 
url is quite important for several reasons. First we can use this function as 
an replacement of URL.getHost() in LinkDB for ignoring internal links, or in 
similar context. Second we can perform statistical analysis on domain names. 
Third we can list subdomains under a domain, etc.. 

I have changed the build.encoding to UTF-8 so that non-ascii characters are 
recognized. 

here is an excerpt from the domain-suffixes.xml file : 
   This document contains top level domains 
as described by the Internet Assigned Numbers
Authotiry (IANA), and second or third level domains that 
are known to be managed by domain registerers. People at 
Mozilla community call these as public suffixes or effective 
tlds. There is no algorithmic way of knowing whether a suffix 
is a public domain suffix, or not. So this large file is used 
for this purpose. The entries in the file is used to find the
domain of a url, which may not the same thing as the host of 
the url. For example for http://lucene.apache.org/nutch; the 
hostname is lucene.apache.org, however the domain name for this
url would be apache.org. Domain names can be quite handy for 
statistical analysis, and fighting against spam.

The list of TLDs is constructed from IANA, and the 
list of effective tlds are constructed from Wikipedia, 
http://wiki.mozilla.org/TLD_List, and http://publicsuffix.org/
The list may not include all the suffixes, but some
effort has been spent to make it comprehensive. Please forward 
any improvements for this list to nutch-dev mailing list, or 
nutch JIRA. 




 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-10 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: domain.suffixes_v2.1.patch

 Very nice patch! 
Thanks !
 IP_PATTERN - it could be tighter, instead of \\d+ it could use \\d{1,3}
now it is (\\d{1,3}\\.){3}(\\d{1,3})

the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The 
reason is that it's a common request for enhancement, but specific 
requirements vary wildly. Some users prefer to build a separate DB that holds 
staistical info and can be used in various steps of the work cycle, others 
still prefer one-time tools such as this one.

DomainStatistics is really a quick hack i've written for demonstration of the 
new patch. I've moved it from the latest patch. Once the user requirements are 
settled, we can move on from there. 

Also you may not want to commit MozillaPublicSuffixListParser.java, but it is 
good we have it somewhere public. 


 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: domain.suffixes_v2.1.patch, tld_plugin_v1.0.patch, 
 tld_plugin_v1.1.patch, tld_plugin_v2.0.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-10 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: (was: domain.suffixes_v2.1.patch)

 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-10 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v2.1.patch

Oops, it seems that i've uploaded the wrong file. This is the correct one. 

 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
 tld_plugin_v2.0.patch, tld_plugin_v2.1.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-510) IndexMerger delete working dir

2007-07-09 Thread Enis Soztutar (JIRA)
IndexMerger delete working dir
--

 Key: NUTCH-510
 URL: https://issues.apache.org/jira/browse/NUTCH-510
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0


IndexMerger does not delete the working dir when an IOException is thrown such 
as No space left on device. Local temporary directories should be deleted. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-510) IndexMerger delete working dir

2007-07-09 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-510:


Attachment: index.merger.delete.temp.dirs.patch

Attached patch deletes working dirs on finally clause, eliminates java 5.0 
warnings and in IndexMerger. 
An FileAlreadyExistsException is thrown if the output index directory already 
exists, which is similar to OutputFormatBase#chechOutputSpecs(); 

 IndexMerger delete working dir
 --

 Key: NUTCH-510
 URL: https://issues.apache.org/jira/browse/NUTCH-510
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: index.merger.delete.temp.dirs.patch


 IndexMerger does not delete the working dir when an IOException is thrown 
 such as No space left on device. Local temporary directories should be 
 deleted. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-510) IndexMerger delete working dir

2007-07-09 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511043
 ] 

Enis Soztutar edited comment on NUTCH-510 at 7/9/07 5:32 AM:
-

Attached patch deletes working dirs on finally clause, eliminates java 5.0 
warnings in IndexMerger. 
An FileAlreadyExistsException is thrown if the output index directory already 
exists, which is similar to OutputFormatBase#chechOutputSpecs(); 


 was:
Attached patch deletes working dirs on finally clause, eliminates java 5.0 
warnings and in IndexMerger. 
An FileAlreadyExistsException is thrown if the output index directory already 
exists, which is similar to OutputFormatBase#chechOutputSpecs(); 

 IndexMerger delete working dir
 --

 Key: NUTCH-510
 URL: https://issues.apache.org/jira/browse/NUTCH-510
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: index.merger.delete.temp.dirs.patch


 IndexMerger does not delete the working dir when an IOException is thrown 
 such as No space left on device. Local temporary directories should be 
 deleted. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-508) ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker

2007-07-09 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511121
 ] 

Enis Soztutar commented on NUTCH-508:
-

Tasktracker invokes another jvm calling TaskTracker$Child but hadoop.log.dir is 
not passed as a parameter. The logs of the user program is handled correctly 
but i suppose the this is caused by the logs at TaskTracker$Child such as
LOG.debug(Child starting);
can not be logged. I suppose you can ask about this in hadoop-dev to see if it 
is indeed a hadoop issue. 

 ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker
 --

 Key: NUTCH-508
 URL: https://issues.apache.org/jira/browse/NUTCH-508
 Project: Nutch
  Issue Type: Bug
 Environment: Linux 2.6, Java1.6
Reporter: Emmanuel Joke
 Fix For: 1.0.0


 As described in http://www.nabble.com/Crawl-error-with-hadoop-t3994217.html
 the log4j config file is missing some parameters.
 hadoop.log.dir=.
 hadoop.log.file=hadoop.log
 Thanks for the help of Mathijs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-27 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-471:


Attachment: NutchBeanCreationSync_v2.patch

From http://www-128.ibm.com/developerworks/java/library/j-dcl.html

The bottom line is that double-checked locking, in whatever form, should not be 
used because you cannot guarantee that it will work on any JVM implementation. 
JSR-133 is addressing issues regarding the memory model, however, 
double-checked locking will not be supported by the new memory model. 
Therefore, you have two options:
* Accept the synchronization of a getInstance() method as shown in Listing 
2.
* Forgo synchronization and use a static field.

We don't want to remise performance in NutchBean.get(), synchronization is not 
a solution. Thus as Sami has suggested, i have written a ServetContextListener 
and added NutchBean construction code there. And modified web.xml to register 
the event listener class. Also In the servlet initialization, the Configuration 
object is initialized and cached by NutchConfiguration, so we avoid the same 
problem in NutchConfiguration.get(). 

 i have tested the implementation and it seems OK. 


 Fix synchronization in NutchBean creation
 -

 Key: NUTCH-471
 URL: https://issues.apache.org/jira/browse/NUTCH-471
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: NutchBeanCreationSync_v1.patch, 
 NutchBeanCreationSync_v2.patch


 NutchBean is created and then cached in servlet context. But 
 NutchBean.get(ServletContext app, Configuration conf) is not syncronized, 
 which causes more than one instance of the bean (and 
 DistributedSearch$Client) if servlet container is accessed rapidly during 
 startup. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-475) Adaptive crawl delay

2007-04-25 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491882
 ] 

Enis Soztutar commented on NUTCH-475:
-

we can use a formula like : 

delay = alpha * delay + (1 - alpha) * (k * t)

where 0  alpha = 1

so that the waiting time is less sensitive to varying reply times of the 
server. 


 Adaptive crawl delay
 

 Key: NUTCH-475
 URL: https://issues.apache.org/jira/browse/NUTCH-475
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: adaptive-delay_draft.patch


 Current fetcher implementation waits a default interval before making another 
 request to the same server (if crawl-delay is not specified in robots.txt). 
 IMHO, an adaptive implementation will be better. If the server is under 
 little load and can server requests fast, then fetcher can ask for more pages 
 in a given interval. Similarly, if the server is suffering from heavy load, 
 fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Enis Soztutar (JIRA)
Fix synchronization in NutchBean creation
-

 Key: NUTCH-471
 URL: https://issues.apache.org/jira/browse/NUTCH-471
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0


NutchBean is created and then cached in servlet context. But 
NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which 
causes more than one instance of the bean (and DistributedSearch$Client) if 
servlet container is accessed rapidly during startup. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-471:


Attachment: NutchBeanCreationSync_v1.patch

this patch synchronizes NutchBean.get((ServletContext app, Configuration conf) 
using servlet context as mutex. (NutchBean)app.getAttribute(nutchBean) is 
checked twice, the first one is not synchronized for performance reasons. 

 Fix synchronization in NutchBean creation
 -

 Key: NUTCH-471
 URL: https://issues.apache.org/jira/browse/NUTCH-471
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: NutchBeanCreationSync_v1.patch


 NutchBean is created and then cached in servlet context. But 
 NutchBean.get(ServletContext app, Configuration conf) is not syncronized, 
 which causes more than one instance of the bean (and 
 DistributedSearch$Client) if servlet container is accessed rapidly during 
 startup. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491313
 ] 

Enis Soztutar commented on NUTCH-471:
-

 Nice trick with the unsynchronized check. :)
Wow, indeed i have used a pattern w/o knowing about it :) Seemed a simple and 
efficient solution to me.

Isn't the DCL declared to be broken? 
After reading http://en.wikipedia.org/wiki/Double-checked_locking, i can say 
that this a very subtle bug. As suggested we can fix it by declaring NutchBean 
volatile. However i guess, that in that case would the servlet container should 
also be configured to use Java 1.5 instead of 1.4. 



 Fix synchronization in NutchBean creation
 -

 Key: NUTCH-471
 URL: https://issues.apache.org/jira/browse/NUTCH-471
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
 Fix For: 1.0.0

 Attachments: NutchBeanCreationSync_v1.patch


 NutchBean is created and then cached in servlet context. But 
 NutchBean.get(ServletContext app, Configuration conf) is not syncronized, 
 which causes more than one instance of the bean (and 
 DistributedSearch$Client) if servlet container is accessed rapidly during 
 startup. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-466) Flexible segment format

2007-04-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485977
 ] 

Enis Soztutar commented on NUTCH-466:
-

This patch will indeed resolve many issues related to storing extra information 
about the crawl. IMO MapFiles will do the job. 
Searcher API can be extended with an interface with a method like 

  E extends Writable getInfo(T extends Writable); 

The implementing class should have a map of Class to MapFiles. 

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
 Assigned To: Andrzej Bialecki 

 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-466) Flexible segment format

2007-04-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485996
 ] 

Enis Soztutar commented on NUTCH-466:
-

 There may be many parts that use the same key/value classes in MapFiles.

Yes indeed you are right. I haven't thought about several parts having the same 
classes. 

 I think the API should select the part by name (String) or some other ID, 
 with a map of byte ID-s to directory names

I thought that the map will be from class names to directory names. 

I think we should use the plugin model, with a registry of segment parts that 
are active for the current configuration

Do you think that we sould also move HitDetailer, HitSummarizer, HitContent and 
Searcher to this plugin system. And should we break the multiple functionality 
in NutchBean and DistributedSearch$Client, and allow for separate index, 
segment servers? 

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
 Assigned To: Andrzej Bialecki 

 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-08 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479262
 ] 

Enis Soztutar commented on NUTCH-455:
-

(from LUCENE-252)

In nutch we have 3 options : 1st is to disallow deleting duplicates on 
tokenized fields(due to FieldCache), 2nd is to index the tokenized field 
twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above 
patch and warm the cache initially in the index servers.

I am in favor of the 3rd option. 
I think first resolving LUCENE-252, and then proceeding with NUTCH-255 is more 
sensible. 

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Fix For: 0.9.0

 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Enis Soztutar (JIRA)
dedup on tokenized fields is faulty
---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Fix For: 0.9.0


(From LUCENE-252) 
nutch uses several index servers, and the search results from these servers are 
merged using a dedup field for for deleting duplicates. The values from this 
field is cached by Lucene's FieldCachImpl. The default is the site field, which 
is indexed and tokenized. However for a Tokenized Field (for example url in 
nutch), FieldCacheImpl returns an array of Terms rather that array of field 
values, so dedup'ing becomes faulty. Current FieldCache implementation does not 
respect tokenized fields , and as described above caches only terms. 

So in the situation that we are searching using url as the dedup field, when 
a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the 
url (such as www or com) rather that the whole url. This prevents using 
tokenized fields in the dedup field. 

I have written a patch for lucene and attached it in 
http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
aforementioned issue about tokenized field caching. However building such a 
cache for about 1.5M documents takes 20+ secs. The code in 
IndexSearcher.translateHits() starts with

if (dedupField != null) 
  dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);

and for the first call of search in IndexSearcher, cache is built. 

Long story short, i have written a patch against IndexSearcher, which in 
constructor warms-up the caches of wanted fields(configurable). I think we 
should vote for LUCENE-252, and then commit the above patch with the last 
version of lucene.






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-28 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.2.patch

This patch is an update of the previous three patches. 
The patch 
1. contains TranslatingRawFieldQueryFilter as an abstract implementation for 
searching certain fields in the index with a different query fieldname. 
2. index-basic indexes the domain and all super domains  in the domain field.
3.query-site is changed so that site:site_name will search domain:site_name

By this plugin we can search site:apache.org, and get results from 
http://issues.apache.org, etc. or we can search site:com to retrieve all .com 
domains. 


 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: index_query_domain_v1.0.patch, 
 index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, 
 TranslatingRawFieldQueryFilter_v1.0.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)
Domain İndexing / Query Filter
--

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar


Hostname's contain information about the domain of th host, and all of the 
subdomains. Indexing and Searching the domains are important for intuitive 
behavior. 

From DomainIndexingFilter javadoc : 
Adds the domain(hostname) and all super domains to the index. 
 * br For http://lucene.apache.org/nutch/ the 
 * following will be added to the index : br 
 * ul
 * lilucene.apache.org /li
 * liapache/li
 * liorg /li
 * /ul
 * All hostnames are domain names, but not all the domain names are 
 * hostnames. In the above example hostname lucene is a 
 * subdomain of apache.org, which is itself a subdomain of 
 * org br
 * 
 
Currently Basic indexing filter indexes the hostname in the site field, and 
query-site plugin 
allows to search in the site field. However site:apache.org will not return 
http://lucene.apache.org

 By indexing the domain, we can be able to search domains. Unlike 
 the site field (indexed by BasicIndexingFilter) search, searching the 
 domain field allows us to retrieve lucene.apache.org to the query 
 apache.org. 
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.0.patch

Patch for index-domain and query-domain plugins. 

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: index_query_domain_v1.0.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: TranslatingRawFieldQueryFilter_v1.0.patch

This patch complements index_query_domain_v1.0.patch. 

However, The class TranslatingRawFieldQueryFilter can be used independently, so 
i have put this in a seperate file. The javadoc reads : 

 * Similar to [EMAIL PROTECTED] RawFieldQueryFilter} except that the index 
 * and query field names can be different. 
 * br
 * This class can be extended by codeQueryFilter/codes to allow 
 * searching a field in the index, but using another field name in the 
 * search. 
 * br
 * For example index field names can be kept in english such as content, 
 * lang, title, ..., however query filters can be build in other languages 

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: index_query_domain_v1.0.patch, 
 TranslatingRawFieldQueryFilter_v1.0.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-02-07 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v1.1.patch

I have forgotten to unset http.agent.name in the v1.0 accidentally. this 
version is the same except agent name is not set. This patch obsoletes v1.0. 


 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-02-06 Thread Enis Soztutar (JIRA)
Top Level Domains Indexing / Scoring


 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar


Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. 
TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds 
into three. infrastructure, generic(such as com, edu) and country code 
tlds(such as en, de , tr, ). Indexing the top level domain and optionally 
boosting is needed for improving the search results and enhancing locality. 




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-02-06 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v1.0.patch

This is a plugin implementation for indexing and scoring top level domains in 
nutch. Tlds are stored in TLDEntry class, which has fields domain, status and 
boost fileds. The tlds are read from an xml file. There is also a xsd for 
validation. 

TLDIndexingFilter implements IndexingFilter interface to index the domain 
extensions (such as net, org, en, de) in the tld field. 

TLDScoringFilter implements ScoringFilter interface. Basically this filter 
multiplies the initial boost(coming from another scoring filter such as opic) 
by the boost of the domain. This way, by configuring boost of say edu domains 
to 1.1, the document boosts in the index of educational sites is boosted by 
1.1. Also local search engines may wish to boost the domains hosted in that 
country. For ex. boosting de domains a little in a German SE seems 
reasonable. An alternative usage may be to lower the boosts of domains such as 
biz, or info, which are known to have lots of spam. 

The users can also query the tld field for advanced search. 

Implementation note : 1. OpicScoringFilter is changed to respect ScoringFilter 
chaining. 
2. some of the second level domains 
such as co.uk is not recognized, but edu.uk is recognized




 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-251) Administration GUI

2006-11-23 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-251?page=all ]

Enis Soztutar updated NUTCH-251:


Attachment: Nutch-251-AdminGUI.tar.gz

I have updated the patch written by stephan.
This version works with Nutch-0.9-dev and hadoop-0.7.1 (current version of 
nutch so far)

First extract the tar.gaz file into the root of nutch. It should copy 
src/plugin/admin-* 
lib/xalan.jar  lib/serializer.jar and lib/hadoop-0.7.2-dev.jar
hadoop_0.7.1_nutch_gui_v2.patch
nutch_0.9-dev_gui_v2.patch

then patch nutch with 
  patch -p0 nutch_0.9-dev_gui_v2.patch 
  (you can test the patch first by running : patch -p0 --dry-run 
nutch_0.9-dev_gui_v2.patch

Patched hadoop is included in the archive, but if you wish you can patch hadoop 
using 
   patch -p0 hadoop_0.7.1_nutch_gui_v2.patch


I have : 
converted necessary java.io.File fields and arguments to 
org.apache.hadoop.fs.Path
replaced deprecated LogFormatter's with LogFactory's
used generics with collections(changed only that I've seen)
written PathSerializable which is implements Serializable interface(needed for 
scheduling)
Some hadoop changes and some changes due to hadoop conflicts. 

I have not tested every feature of this plugin so, there still can be some 
bugs. 

 Administration GUI
 --

 Key: NUTCH-251
 URL: http://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.9.0

 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-11-16 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Enis Soztutar updated NUTCH-289:


Attachment: ipInCrawlDatumDraftV5.1.patch

The version 5 patch does not run on the current build. So i have fixed it and 
resend the patch(did not changed any code). I think this patch should be 
included in the trunk. 

 CrawlDatum should store IP address
 --

 Key: NUTCH-289
 URL: http://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cutting
 Attachments: ipInCrawlDatumDraftV1.patch, 
 ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, 
 ipInCrawlDatumDraftV5.patch


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-11-07 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Attachment: urlTokenizer-improved.diff

This is an improvement and a minor bug fix over the previous url tokenizer. 
This version first replaces characters, which are represented in hexadecimal 
format in the urls. 

For example the url file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html will 
first be converted to file:///tmp/foo baz bar/foo/baz~bar/index.html by 
replacing the %20 characters with the space. 

A NullPointerException is corrected in case or input reader returning null for 
the url. 

Further improvements on the url tokenization can be discussed here. 


 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer-improved.diff, urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2006-11-07 Thread Enis Soztutar (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447787 ] 

Enis Soztutar commented on NUTCH-393:
-

Also IndexingException is catched by the Indexer, in which  case the whole 
document is not added to the writer (the function returns).

Indexer : 334
try {
// run indexing filters
doc = this.filters.filter(doc, parse, (UTF8)key, fetchDatum, inlinks);
} catch (IndexingException e) {
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
   return;
}  

IndexingException should be cought in the IndexingFilters.filter(), so that 
when an IndexingException is thrown in one indexing plugin, the other plugins 
could still be run. 



 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: http://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-30 Thread Enis Soztutar (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12445512 ] 

Enis Soztutar commented on NUTCH-389:
-

Otis you can test the tokenizer using the TestUrlTokenizer junit test case. And 
you cab test the NutchDocumentTokenizer by running the NutchDocumentTokenizer's 
main method. 

NutchDocumentTokzenizer tokenizes http://www.foo_bar.com/baz_bar?cardar_mar as 

http www foo_bar com baz_bar cardar_mar


whereas urlTokzenizer tokenizes the above url as

http www foo bar com baz bar car dar mar

so it will hit the queries baz, bar,car. dar and mar as well.

for the url 
http://www.google.com.tr/firefox?client=firefox-arls=org.mozilla:en-US:official

NutchDocumentTokenizer gives tokens : http www google com tr firefox client 
firefox arls org mozilla en us official
urlTokenizer gives tokens : http www google com tr firefox client firefox a rls 
org mozilla en US official 



 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)
a url tokenizer implementation for tokenizing index fields : url and host 
--

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor


NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Attachment: urlTokenizer.diff

patch for url tokenization

 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Description: 
NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 

NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, 
site and host fields.


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

  was:
NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html


 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-30 Thread Enis Soztutar (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12431548 ] 

Enis Soztutar commented on NUTCH-356:
-

I observed strange behaviour, when one of the plug-ins could not be included. 
For example the plugin system fails to load plugins, when, there is a circular 
dependency among them or the name of the plug-in is misspelled in the 
configuration. 

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: http://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Attachments: NutchTest.java, patch.txt


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira