Two problems of lucene.

2006-02-04 Thread xing jiang
Hi,

I got two problems of lucene.

1. How does the lucene calculate each term's weight in the query? Is it a
simple boolean value?

2. Can i change the similarity measure in the lucene? For instance, i only
use the term frequence instead of the tf/idf value to give weight to each
term in the document.



--
Regards

Jiang Xing


two problems of using the lucene.

2006-02-04 Thread xing jiang
Hi,

I got two problems of using the lucene and may need your help.

1. For each word, how the lucene calculate its weight. I only know for each
work in the document will be weighed by its tf/idf values.

2. Can I modify the lucene so that i use the term frequency instead of
tf/idf value to calculate the similarity between documents and queries.

--
Regards

Jiang Xing


Re: Related searches

2006-02-01 Thread xing jiang
Hi, I got a question of doing the related search.

For instance, if I want to say Support Vector Machine == SVM. Then, How
can i use this information when retrieve documents. I dont think it can be
added in the Synonym Filter.


On 2/1/06, Dave Kor [EMAIL PROTECTED] wrote:

 On 1/30/06, Leon Chaddock [EMAIL PROTECTED] wrote:
  Hi,
  Does anyone know if it is possible to show related searches with lucene,
 for example if
  someone searched for car insurance you could bring back the results
 and related
  searches like these

 One possible way is to use the vector space model on the set of
 relevant documents returned by each query.

 For example,
 Relevant documents for the query car insurance are docids 1, 2, 4, 9,
 10.
 Relevant documents for the query automobile insurance are docids 2,
 4, 8, 9, 10.
 Relevant documents for the query life insurance are docids 3, 5, 7, 9.

 Here, automobile insurance will be scored as more similar to car
 insurance than life insurance because there is a larger set of
 overlapping docids.

 Lucene can be adapted for this purpose by creating a second index that
 stores all unique queries and their set of relevant docids as Lucene
 Documents. Instead of indexing text terms, we index docids. Finding
 queries similiar to the original query, Q, is a simple matter of
 querying this second index with the set of docids relevent to query Q.

 Hope this helps.


 --
 Dave Kor, Research Assistant
 Center for Information Mining and Extraction
 School of Computing
 National University of Singapore.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing


Re: Related searches

2006-01-31 Thread xing jiang
I think you should build a type of domain specific dictionary first. You
should say, for instance, automobile = car. This approach can satisfy your
requirement.

On 1/30/06, Leon Chaddock [EMAIL PROTECTED] wrote:

 Hi,
 Does anyone know if it is possible to show related searches with lucene,
 for example if someone searched for car insurance you could bring back the
 results and related searches like these


 Automobile Insurance
 Car Insurance Quote
 Car Insurance Quotes
 Auto Insurance
 Cheap Car Insurance
 Car Insurance Company
 Car Insurance Companies
 Health Insurance
 Car Insurance Rates
 Car Insurance Rate
 Car Insurance Rental
 Insurance Quote
 Online Car Insurance Quote
 Home Insurance

 Thanks

 Leon




--
Regards

Jiang Xing


Re: How does the lucene normalize the score?

2006-01-27 Thread xing jiang
hi,

thank you for your help.


On 1/27/06, Chris Lamprecht [EMAIL PROTECTED] wrote:

 It takes the highest scoring document, if greater than 1.0, and
 divides every hit's score by this number, leaving them all = 1.0.
 Actually, I just looked at the code, and it actually does this by
 taking 1/maxScore and then multiplying this by each score (equivalent
 results in the end, maybe more efficient(?)).  See the method
 getMoreDocs() in Hits.java (org.apache.lucene.search.Hits):

 [...]
float scoreNorm = 1.0f;

if (length  0  topDocs.getMaxScore()  1.0f) {
  scoreNorm = 1.0f / topDocs.getMaxScore();
}

int end = scoreDocs.length  length ? scoreDocs.length : length;
for (int i = hitDocs.size(); i  end; i++) {
  hitDocs.addElement(new HitDoc(scoreDocs[i].score * scoreNorm,
scoreDocs[i].doc));
}



 On 1/27/06, xing jiang [EMAIL PROTECTED] wrote:
  Hi,
 
  I want to know how the lucene normalizes the score. I see hits class has
  this function to get each document's score. But i dont know how lucene
  calculates the normalized score and in the Lucene in action, it only
 said
  normalized score of the nth top scoring docuemnts.
  --
  Regards
 
  Jiang Xing
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing


How does the lucene normalize the score?

2006-01-26 Thread xing jiang
Hi,

I want to know how the lucene normalizes the score. I see hits class has
this function to get each document's score. But i dont know how lucene
calculates the normalized score and in the Lucene in action, it only said
normalized score of the nth top scoring docuemnts.
--
Regards

Jiang Xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-19 Thread xing jiang
Hi Mathias,

Can you give more details?  Is your application for text + ontology, or
ontology only?

regards

jiang xing

On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote:

 Hi!

 (1) I'm working on a similar problem, but based on MPEG-7 Semantic
 Description Graphs. I've already a prototype for pakth based matching
 within Lucene integrated in my sf project Caliph  Emir
 (http://caliph-emir.sf.net). I've already adapted the approach to an
 ontology, which had to be searched.

 My approach works roughly like this:
 * index all paths up to a certain length in a graph as strings in
 Lucene
 * index all node descriptions in another index
 * Within the query graph nodes are lucene queries - query expansion to
 node ids based on the node index
 * search for all expanded query graphs and merge results.

 Unfortunately I didn't have time yet to do a full evaluation, but
 preeliminary results are promising.

 The valuation and a more comprehensive description of the approach can
 be found in the proceedings of the TIR 05 (Text Information Retrieval
 Workschop 2005 in Koblenz, Germany):
 http://www-ai.upb.de/aisearch/tir-05/proceedings/lux05-fast-and-simple-p
 ath-index-based-retrieval.pdf

 The prototype is available @ http://caliph-emir.sf.net.

 I'm open to comments and ideas on the approach as it is part of my PhD
 and I'm working on a method without query expansion :-)

 (2) A second thing is the feature based retrieval of nodes within an
 ontology, which allows really fast indexing and retrieval as no
 pathwalking takes place.
 Works like this:
 * nodes being the documents / entities being searched for in the
 ontology are extracted


* surrounding nodes / literals are used as characteristic features
 * with some heuristics and some runtime configuration classifications,
 text  keyword fields are separated
 * Retrieval is purely based on text and keywords, the same with
 similarity search
 * additional Clustering is done on snippets from search results.
 I already have a prototype running with this approach, but no evaluation
 yet, sorry! For more information on this one please contact me. A
 publication on this is currently in review, so I cannot give a link here
 ;(

 References:
 - Rodriguez, M.A.  Egenhofer, M.J. (2003), 'Determining Semantic
 Similarity among Entity Classes from Different Ontologies', IEEE
 Transactions on Knowledge and Data Engineering 15(2), 442--456.
 - Varelas, G.; Voutsakis, E.; Raftopoulou, P.; Petrakis, E.G.  Milios,
 E.E. (2005), Semantic similarity meth-ods in wordNet and their
 application to information retrieval on the web, in 'WIDM '05:
 Proceedings of the 7th annual ACM international workshop on Web
 information and data management', ACM Press, New York, NY, USA, pp.
 10--16.


 regards,
 Mathias

 
 DI Mathias Lux
 Know-Center  Graz University of Technology, Austria
 Institute for Knowledge Management (IWM)
 Inffeldgasse 21a, 8010 Graz, Austria
 Email  : [EMAIL PROTECTED]
 Tel: +43 316 873 9274  Fax: +43 316 873 9252

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-19 Thread xing jiang
Hi,

I am not sure whether my understanding is correct.

In your application, A concept document first should be defined as a class
in the ontology? Then, each document is an instance of this class. It uses
its contents as its features. Also, the related concepts will be added into
the feature vector.

I think besides how to select the features, another problem is how to define
the similarity measure. Given a query submitted. How do you define the
similarity between the query and the result? One document is featured by its
keywords and the ontological annotations.

Yours truly,
Jiang Xing




On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote:

 Its for both, onto + contents (Word, Pdf, PPT, all time the same
 candidates). The main disadvantage of this approach is that main nodes in
 the ontology have to be defined.

 Imagine following use case:
 An ontology describes a companies content and knowledge management system.
 Persons, hierarchies, projects, documents and concepts (project manager,
 author, requirements document and so on) are within this ontology.
 If you for instance choose to index all documents, projects and persons
 you identify all nodes, that symbolize the documents, projects and persons.
 For persons you can then index name (given-, surname, title and so on),
 department, age, ... For documents you can take the contents (from the
 document file itself), all metadata, the authors name, the category and type
 and so on. For projects you can extract from the ontology e.g. the
 description, name, participants, ... In the lucene index there are only 3
 types of nodes indexed.

 The main point in this is: How do you define the features of a node (in
 this use case document, person and project) and which neighbouring literals
 describe the concept of the ontology best?

 The selection of concepts / classes / node types (whatever :) depends on
 the use case.

 hope this helps a bit,
 mathias

  -Ursprüngliche Nachricht-
  Von: xing jiang [mailto:[EMAIL PROTECTED]
  Gesendet: Donnerstag, 19. Jänner 2006 12:14
  An: java-user@lucene.apache.org
  Betreff: Re: Use the lucene for searching in the Semantic Web.
 
  Hi Mathias,
 
  Can you give more details?  Is your application for text +
  ontology, or
  ontology only?
 
  regards
 
  jiang xing
 
  On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote:
  
   Hi!
  
   (1) I'm working on a similar problem, but based on MPEG-7 Semantic
   Description Graphs. I've already a prototype for pakth
  based matching
   within Lucene integrated in my sf project Caliph  Emir
   (http://caliph-emir.sf.net). I've already adapted the approach to an
   ontology, which had to be searched.
  
   My approach works roughly like this:
   * index all paths up to a certain length in a graph as strings in
   Lucene
   * index all node descriptions in another index
   * Within the query graph nodes are lucene queries - query
  expansion to
   node ids based on the node index
   * search for all expanded query graphs and merge results.
  
   Unfortunately I didn't have time yet to do a full evaluation, but
   preeliminary results are promising.
  
   The valuation and a more comprehensive description of the
  approach can
   be found in the proceedings of the TIR 05 (Text Information
  Retrieval
   Workschop 2005 in Koblenz, Germany):
  
  http://www-ai.upb.de/aisearch/tir-05/proceedings/lux05-fast-an
 d-simple-p
   ath-index-based-retrieval.pdf
  
   The prototype is available @ http://caliph-emir.sf.net.
  
   I'm open to comments and ideas on the approach as it is
  part of my PhD
   and I'm working on a method without query expansion :-)
  
   (2) A second thing is the feature based retrieval of nodes within an
   ontology, which allows really fast indexing and retrieval as no
   pathwalking takes place.
   Works like this:
   * nodes being the documents / entities being searched for in the
   ontology are extracted
 
 
  * surrounding nodes / literals are used as characteristic features
   * with some heuristics and some runtime configuration
  classifications,
   text  keyword fields are separated
   * Retrieval is purely based on text and keywords, the same with
   similarity search
   * additional Clustering is done on snippets from search results.
   I already have a prototype running with this approach, but
  no evaluation
   yet, sorry! For more information on this one please contact me. A
   publication on this is currently in review, so I cannot
  give a link here
   ;(
  
   References:
   - Rodriguez, M.A.  Egenhofer, M.J. (2003), 'Determining Semantic
   Similarity among Entity Classes from Different Ontologies', IEEE
   Transactions on Knowledge and Data Engineering 15(2), 442--456.
   - Varelas, G.; Voutsakis, E.; Raftopoulou, P.; Petrakis,
  E.G.  Milios,
   E.E. (2005), Semantic similarity meth-ods in wordNet and their
   application to information retrieval on the web, in 'WIDM '05:
   Proceedings of the 7th annual ACM international workshop on Web

Re: Use the lucene for searching in the Semantic Web.

2006-01-19 Thread xing jiang
On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote:



  -Ursprüngliche Nachricht-
  Von: xing jiang [mailto:[EMAIL PROTECTED]
  Gesendet: Donnerstag, 19. Jänner 2006 13:11
  An: java-user@lucene.apache.org
  Betreff: Re: Use the lucene for searching in the Semantic Web.
 
  Hi,
 
  I am not sure whether my understanding is correct.
 
  In your application, A concept document first should be
  defined as a class
  in the ontology? Then, each document is an instance of this
  class. It uses
  its contents as its features. Also, the related concepts will
  be added into
  the feature vector.

 Yes, thats it in general. You decide which classes are the ones to index
 and select all instances from this class or ist subclasses.

  I think besides how to select the features, another problem
  is how to define
  the similarity measure. Given a query submitted. How do you define the
  similarity between the query and the result? One document is
  featured by its
  keywords and the ontological annotations.
 the similarity measure is term based, tf*idf weighted in ist simple form.
 Further enhancement would be a weighting of nodes e.g. based on
 information content (see e.g. Rodriguez, M.A.  Egenhofer, M.J. (2003)),
 where a test corpus helps to weight the importance of nodes based on their
 labels. But this is just a direction, not tested yet.


Actually, my problem is that, for instance, for a document d, Its feature
vector may be keywords and concepts. I don't know how to weight the two
items. Right now, i used a stupid method, given a document d, i can obtain a
rank D based on keyword method. Also, it is annotated with a concept c (The
most simple example) . People can have a rank  C of these concepts in the
domain ontology, where the most relevant concepts should be the at top of
this concept list. Finally, document's rank is decided by the sum of (C +
D).


To introduce path based similarity using lucene I'm afraid is in my opinion
 impossible :) What someone - if not me - could try is to use the structural
 context of a node instead of the textual context based on paths as I've done
 with MPEG-7. This should be quite easy as RDF shares most characteristics
 with MPEG-7 semantic graphs, having e.g. unique node labels (URIs per
 definitionem in RDF), a limited set of possible relations (limited by the
 number of nodes in RDF, but that should do also) and so on.

 - mathias

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-19 Thread xing jiang
On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote:


  Actually, my problem is that, for instance, for a document d,
  Its feature
  vector may be keywords and concepts. I don't know how to
  weight the two
  items. Right now, i used a stupid method, given a document d,
  i can obtain a
  rank D based on keyword method. Also, it is annotated with a
  concept c (The
  most simple example) . People can have a rank  C of these
  concepts in the
  domain ontology, where the most relevant concepts should be
  the at top of
  this concept list. Finally, document's rank is decided by the
  sum of (C +
  D).

 hmm, if you index the concepts e.g. based on ist URI in a Lucene Filed
 you can set a boost value at indexing time like this:

 Field conceptField = Field.Text(classification,
 http://concepts.server.com/classification/car/mercedes;)
 conceptField.setBoost(1.3f);

 So your concept for this document, where the filed is added, is boosted
 in relevance computation.

 if you know the concept boost value at search time you can add the boost
 value to the query:
 e.g. querying for

 classification:http://concepts.server.com/classification/car/mercedes^
 4

 Of course you have to think about the whole thing, but I think with good
 boost values it would work.

 - mathias

 ps. instead of C+D I would use (l-1)*C + l*D, so l from [0,1] can be
 used to specify if concept or content has more influence.


I will compute each concepts relevant to each query. Thus, i cannot set the
boost value. Actually, I use the (l-1)*C + l*D method in my prototype. But
my supervisor said this method is funny as it is too simple.


-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-19 Thread xing jiang
On 1/20/06, Klaus [EMAIL PROTECTED] wrote:

 Hi,

 Actually, my problem is that, for instance, for a document d, Its feature
 vector may be keywords and concepts.

 What do you exactly mean by features vector? You are referring to the
 predicate - object pairs, connected to one subject node, don't you?


The feature vector may be bigger than the object-predicate pairs. In my
application, each document may be annotated with several concepts to say
this document contains an instance of a class. Thus, each document will be
indexed like (keywords in the document, concepts in the document) -- URI.


I don't know how to weight the two
 items. Right now, i used a stupid method, given a document d, i can
 obtain
 a rank D based on keyword method. Also, it is annotated with a concept c
 (The most simple example) . People can have a rank  C of these concepts
 in
 the domain ontology, where the most relevant concepts should be the at
 top
 of this concept list. Finally, document's rank is decided by the sum of
 (C
 +D).

 I'm going to implement something like a pagerank algorithm for my search
 engine. In Contrast to the google approach I cannot just count the edge,
 of
 one node, because of the know semantic I can weight them. Of course this
 implies a knowledge of the domain ontology. For instance if there is a
 predicate cited_in_document I could rank a document higher, if it is
 often
 cited. But I'm not sure about the results...


I am very interesting at your approach. You can see the page rank like
method used in the SWOOGLE. But the relations they used only some simple
relations, Such as import (used in OWL files). IF we can use the Semantic
level relations, It's should be better. But I am not sure it can succeed, as
it requires how to weight the relations.

Klaus


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-18 Thread xing jiang
Hi,

I have done some surveys about the information retrieval on the Semantic
Web, (maybe i miss many papers, most papers i used are published in recent
WWW and CIKM conferences, :).

1. A typical way of using the ontology is to select exact term from the
domain ontology to form queries. The first one may be the OntoSeek (
www.loa-cnr.it/Papers/OntoSeek.pdf ).  Similar work may be Latifur Khan's
work (Retrieval effectiveness of an ontology-based model for information
selection, VLDB 2004).

2. Guha et al. (Semantic Search, WWW2003) used domain ontology to form a
concept graph. Then, users only need to browse the concept graph egenerated.
Similar work may be Eero Hyvonen's  work MuseumFinland. They all used the
semantic structure of the domain ontology to help users browsing.

3. QuizRDF (www.cs.rutgers.edu/~shklar/www11/final_submissions/paper6.pdf) used
another kind of method for using the domain ontology.  Klaus, I think your
method  should be better than QuizRDF.

One interesting method i found is Roha's work (A hybrid approach for
searching in the Semantic Web, WWW2004). They still used keyword based
method for retrieving documents on the Semantic Web. But i cannot find any
more information about their work and the application i am building can be
seen an extension of their work.

Actually, the swoogle focuses on the ontology level's files only. It will
crawls RDF, OWL  DAML files. But they do not provide any new method to
combine the traditional keyword method for searching the text files. Li Ding
used a variant of page rank method for ontology files. But i am not sure
this method can be combined with the page rank method.

Maybe i have missed too many things when i do this survey. However, I think
we may can find some good new methods of using the domain ontology in the
Semantic Web.

Yours truly,
Jiang Xing




On 1/19/06, Klaus [EMAIL PROTECTED] wrote:

 Hello,


 Hi,

 I think one problem of the existing method is that, to query on the RDF
 files or similar structures, we have to form SQL like queries. However,
 for
 searching in the text files, we only need to type several keywords. Can
 we
 combine the two methods and how can we combine the two methods. For
 instance, i only need to enter some keywords.

 Yes you are right. At the moment I offer the users a UI where the can
 input
 some keywords and in addition to this some rql like query via drop down
 menus. With the help of the this semantic query, they can specify the
 results demarcate the result set, e.g. saying that all result's should
 belong to one class, or deal with one theme.

 Now I try automate the generation of the query... But I'm not sure how to
 do
 this exactly. Maybe I will use some kind of pseudo relevance feedback to
 make some semantic analysis an the first result set.


 Why do we have to learn some SQL like language for
 searching in the Semantic Web.

 Maybe this paper can help you... Primary the semantic web is for agents
 and
 so on, not for humans. So the information has to have a structure, which
 can
 be exploited.

 http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF
 21

 By the way, maybe you should take a look at http://swoogle.umbc.edu/ There
 is also quite a big number of papers an scholar.google.

 Do your have any ideas, right now?

 Peace

 Klaus






--
Regards

Jiang Xing


Re: Use the lucene for searching in the Semantic Web.

2006-01-17 Thread xing jiang
Hi,

I think one problem of the existing method is that, to query on the RDF
files or similar structures, we have to form SQL like queries. However, for
searching in the text files, we only need to type several keywords. Can we
combine the two methods and how can we combine the two methods. For
instance, i only need to enter some keywords. Then, the system can handle
the left process. Why do we have to learn some SQL like language for
searching in the Semantic Web.

regards
Jiang Xing


On 1/18/06, Klaus [EMAIL PROTECTED] wrote:

 Hi Jiang,

 I'm currently facing a similar problem. Up to now I have to use for the
 semantic query a graph matching algorithm, but the fulltext search in the
 semantic web is performed by lucene.
 At first I wrote the whole text into a one index. The document contains
 one
 field for the unique id and on for the whole text. For the semantic markup
 I
 use an extra index. Every rdf triple will result in a document with the
 following fields id, predicate + subject + object. Every query is executed
 on both indexes. I use an extra index for the rdf data, because this
 results
 in a higher score for the documents. You might argue that this would
 adulterate the result, but from me point of view explicit Meta data should
 be higher scored then terms in document body.

 Cheers,

 Klaus

 -Ursprüngliche Nachricht-
 Von: jason [mailto:[EMAIL PROTECTED]
 Gesendet: Dienstag, 17. Januar 2006 15:35
 An: java-user@lucene.apache.org
 Betreff: Use the lucene for searching in the Semantic Web.

 Hi friends,

 How do you think use the lucene for searching in the Semantic Web? I am
 trying using the lucene for searching documents with ontological
 annotation.
 But i do not get a better model to combine the keywords information and
 the
 ontological information.

 regards
 jiang xing


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




--
Regards

Jiang Xing