Two problems of lucene.
Hi, I got two problems of lucene. 1. How does the lucene calculate each term's weight in the query? Is it a simple boolean value? 2. Can i change the similarity measure in the lucene? For instance, i only use the term frequence instead of the tf/idf value to give weight to each term in the document. -- Regards Jiang Xing
two problems of using the lucene.
Hi, I got two problems of using the lucene and may need your help. 1. For each word, how the lucene calculate its weight. I only know for each work in the document will be weighed by its tf/idf values. 2. Can I modify the lucene so that i use the term frequency instead of tf/idf value to calculate the similarity between documents and queries. -- Regards Jiang Xing
Re: Related searches
Hi, I got a question of doing the related search. For instance, if I want to say Support Vector Machine == SVM. Then, How can i use this information when retrieve documents. I dont think it can be added in the Synonym Filter. On 2/1/06, Dave Kor [EMAIL PROTECTED] wrote: On 1/30/06, Leon Chaddock [EMAIL PROTECTED] wrote: Hi, Does anyone know if it is possible to show related searches with lucene, for example if someone searched for car insurance you could bring back the results and related searches like these One possible way is to use the vector space model on the set of relevant documents returned by each query. For example, Relevant documents for the query car insurance are docids 1, 2, 4, 9, 10. Relevant documents for the query automobile insurance are docids 2, 4, 8, 9, 10. Relevant documents for the query life insurance are docids 3, 5, 7, 9. Here, automobile insurance will be scored as more similar to car insurance than life insurance because there is a larger set of overlapping docids. Lucene can be adapted for this purpose by creating a second index that stores all unique queries and their set of relevant docids as Lucene Documents. Instead of indexing text terms, we index docids. Finding queries similiar to the original query, Q, is a simple matter of querying this second index with the set of docids relevent to query Q. Hope this helps. -- Dave Kor, Research Assistant Center for Information Mining and Extraction School of Computing National University of Singapore. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing
Re: Related searches
I think you should build a type of domain specific dictionary first. You should say, for instance, automobile = car. This approach can satisfy your requirement. On 1/30/06, Leon Chaddock [EMAIL PROTECTED] wrote: Hi, Does anyone know if it is possible to show related searches with lucene, for example if someone searched for car insurance you could bring back the results and related searches like these Automobile Insurance Car Insurance Quote Car Insurance Quotes Auto Insurance Cheap Car Insurance Car Insurance Company Car Insurance Companies Health Insurance Car Insurance Rates Car Insurance Rate Car Insurance Rental Insurance Quote Online Car Insurance Quote Home Insurance Thanks Leon -- Regards Jiang Xing
Re: How does the lucene normalize the score?
hi, thank you for your help. On 1/27/06, Chris Lamprecht [EMAIL PROTECTED] wrote: It takes the highest scoring document, if greater than 1.0, and divides every hit's score by this number, leaving them all = 1.0. Actually, I just looked at the code, and it actually does this by taking 1/maxScore and then multiplying this by each score (equivalent results in the end, maybe more efficient(?)). See the method getMoreDocs() in Hits.java (org.apache.lucene.search.Hits): [...] float scoreNorm = 1.0f; if (length 0 topDocs.getMaxScore() 1.0f) { scoreNorm = 1.0f / topDocs.getMaxScore(); } int end = scoreDocs.length length ? scoreDocs.length : length; for (int i = hitDocs.size(); i end; i++) { hitDocs.addElement(new HitDoc(scoreDocs[i].score * scoreNorm, scoreDocs[i].doc)); } On 1/27/06, xing jiang [EMAIL PROTECTED] wrote: Hi, I want to know how the lucene normalizes the score. I see hits class has this function to get each document's score. But i dont know how lucene calculates the normalized score and in the Lucene in action, it only said normalized score of the nth top scoring docuemnts. -- Regards Jiang Xing - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing
How does the lucene normalize the score?
Hi, I want to know how the lucene normalizes the score. I see hits class has this function to get each document's score. But i dont know how lucene calculates the normalized score and in the Lucene in action, it only said normalized score of the nth top scoring docuemnts. -- Regards Jiang Xing
Re: Use the lucene for searching in the Semantic Web.
Hi Mathias, Can you give more details? Is your application for text + ontology, or ontology only? regards jiang xing On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote: Hi! (1) I'm working on a similar problem, but based on MPEG-7 Semantic Description Graphs. I've already a prototype for pakth based matching within Lucene integrated in my sf project Caliph Emir (http://caliph-emir.sf.net). I've already adapted the approach to an ontology, which had to be searched. My approach works roughly like this: * index all paths up to a certain length in a graph as strings in Lucene * index all node descriptions in another index * Within the query graph nodes are lucene queries - query expansion to node ids based on the node index * search for all expanded query graphs and merge results. Unfortunately I didn't have time yet to do a full evaluation, but preeliminary results are promising. The valuation and a more comprehensive description of the approach can be found in the proceedings of the TIR 05 (Text Information Retrieval Workschop 2005 in Koblenz, Germany): http://www-ai.upb.de/aisearch/tir-05/proceedings/lux05-fast-and-simple-p ath-index-based-retrieval.pdf The prototype is available @ http://caliph-emir.sf.net. I'm open to comments and ideas on the approach as it is part of my PhD and I'm working on a method without query expansion :-) (2) A second thing is the feature based retrieval of nodes within an ontology, which allows really fast indexing and retrieval as no pathwalking takes place. Works like this: * nodes being the documents / entities being searched for in the ontology are extracted * surrounding nodes / literals are used as characteristic features * with some heuristics and some runtime configuration classifications, text keyword fields are separated * Retrieval is purely based on text and keywords, the same with similarity search * additional Clustering is done on snippets from search results. I already have a prototype running with this approach, but no evaluation yet, sorry! For more information on this one please contact me. A publication on this is currently in review, so I cannot give a link here ;( References: - Rodriguez, M.A. Egenhofer, M.J. (2003), 'Determining Semantic Similarity among Entity Classes from Different Ontologies', IEEE Transactions on Knowledge and Data Engineering 15(2), 442--456. - Varelas, G.; Voutsakis, E.; Raftopoulou, P.; Petrakis, E.G. Milios, E.E. (2005), Semantic similarity meth-ods in wordNet and their application to information retrieval on the web, in 'WIDM '05: Proceedings of the 7th annual ACM international workshop on Web information and data management', ACM Press, New York, NY, USA, pp. 10--16. regards, Mathias DI Mathias Lux Know-Center Graz University of Technology, Austria Institute for Knowledge Management (IWM) Inffeldgasse 21a, 8010 Graz, Austria Email : [EMAIL PROTECTED] Tel: +43 316 873 9274 Fax: +43 316 873 9252 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing
Re: Use the lucene for searching in the Semantic Web.
Hi, I am not sure whether my understanding is correct. In your application, A concept document first should be defined as a class in the ontology? Then, each document is an instance of this class. It uses its contents as its features. Also, the related concepts will be added into the feature vector. I think besides how to select the features, another problem is how to define the similarity measure. Given a query submitted. How do you define the similarity between the query and the result? One document is featured by its keywords and the ontological annotations. Yours truly, Jiang Xing On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote: Its for both, onto + contents (Word, Pdf, PPT, all time the same candidates). The main disadvantage of this approach is that main nodes in the ontology have to be defined. Imagine following use case: An ontology describes a companies content and knowledge management system. Persons, hierarchies, projects, documents and concepts (project manager, author, requirements document and so on) are within this ontology. If you for instance choose to index all documents, projects and persons you identify all nodes, that symbolize the documents, projects and persons. For persons you can then index name (given-, surname, title and so on), department, age, ... For documents you can take the contents (from the document file itself), all metadata, the authors name, the category and type and so on. For projects you can extract from the ontology e.g. the description, name, participants, ... In the lucene index there are only 3 types of nodes indexed. The main point in this is: How do you define the features of a node (in this use case document, person and project) and which neighbouring literals describe the concept of the ontology best? The selection of concepts / classes / node types (whatever :) depends on the use case. hope this helps a bit, mathias -Ursprüngliche Nachricht- Von: xing jiang [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 19. Jänner 2006 12:14 An: java-user@lucene.apache.org Betreff: Re: Use the lucene for searching in the Semantic Web. Hi Mathias, Can you give more details? Is your application for text + ontology, or ontology only? regards jiang xing On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote: Hi! (1) I'm working on a similar problem, but based on MPEG-7 Semantic Description Graphs. I've already a prototype for pakth based matching within Lucene integrated in my sf project Caliph Emir (http://caliph-emir.sf.net). I've already adapted the approach to an ontology, which had to be searched. My approach works roughly like this: * index all paths up to a certain length in a graph as strings in Lucene * index all node descriptions in another index * Within the query graph nodes are lucene queries - query expansion to node ids based on the node index * search for all expanded query graphs and merge results. Unfortunately I didn't have time yet to do a full evaluation, but preeliminary results are promising. The valuation and a more comprehensive description of the approach can be found in the proceedings of the TIR 05 (Text Information Retrieval Workschop 2005 in Koblenz, Germany): http://www-ai.upb.de/aisearch/tir-05/proceedings/lux05-fast-an d-simple-p ath-index-based-retrieval.pdf The prototype is available @ http://caliph-emir.sf.net. I'm open to comments and ideas on the approach as it is part of my PhD and I'm working on a method without query expansion :-) (2) A second thing is the feature based retrieval of nodes within an ontology, which allows really fast indexing and retrieval as no pathwalking takes place. Works like this: * nodes being the documents / entities being searched for in the ontology are extracted * surrounding nodes / literals are used as characteristic features * with some heuristics and some runtime configuration classifications, text keyword fields are separated * Retrieval is purely based on text and keywords, the same with similarity search * additional Clustering is done on snippets from search results. I already have a prototype running with this approach, but no evaluation yet, sorry! For more information on this one please contact me. A publication on this is currently in review, so I cannot give a link here ;( References: - Rodriguez, M.A. Egenhofer, M.J. (2003), 'Determining Semantic Similarity among Entity Classes from Different Ontologies', IEEE Transactions on Knowledge and Data Engineering 15(2), 442--456. - Varelas, G.; Voutsakis, E.; Raftopoulou, P.; Petrakis, E.G. Milios, E.E. (2005), Semantic similarity meth-ods in wordNet and their application to information retrieval on the web, in 'WIDM '05: Proceedings of the 7th annual ACM international workshop on Web
Re: Use the lucene for searching in the Semantic Web.
On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote: -Ursprüngliche Nachricht- Von: xing jiang [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 19. Jänner 2006 13:11 An: java-user@lucene.apache.org Betreff: Re: Use the lucene for searching in the Semantic Web. Hi, I am not sure whether my understanding is correct. In your application, A concept document first should be defined as a class in the ontology? Then, each document is an instance of this class. It uses its contents as its features. Also, the related concepts will be added into the feature vector. Yes, thats it in general. You decide which classes are the ones to index and select all instances from this class or ist subclasses. I think besides how to select the features, another problem is how to define the similarity measure. Given a query submitted. How do you define the similarity between the query and the result? One document is featured by its keywords and the ontological annotations. the similarity measure is term based, tf*idf weighted in ist simple form. Further enhancement would be a weighting of nodes e.g. based on information content (see e.g. Rodriguez, M.A. Egenhofer, M.J. (2003)), where a test corpus helps to weight the importance of nodes based on their labels. But this is just a direction, not tested yet. Actually, my problem is that, for instance, for a document d, Its feature vector may be keywords and concepts. I don't know how to weight the two items. Right now, i used a stupid method, given a document d, i can obtain a rank D based on keyword method. Also, it is annotated with a concept c (The most simple example) . People can have a rank C of these concepts in the domain ontology, where the most relevant concepts should be the at top of this concept list. Finally, document's rank is decided by the sum of (C + D). To introduce path based similarity using lucene I'm afraid is in my opinion impossible :) What someone - if not me - could try is to use the structural context of a node instead of the textual context based on paths as I've done with MPEG-7. This should be quite easy as RDF shares most characteristics with MPEG-7 semantic graphs, having e.g. unique node labels (URIs per definitionem in RDF), a limited set of possible relations (limited by the number of nodes in RDF, but that should do also) and so on. - mathias - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing
Re: Use the lucene for searching in the Semantic Web.
On 1/19/06, Mathias Lux [EMAIL PROTECTED] wrote: Actually, my problem is that, for instance, for a document d, Its feature vector may be keywords and concepts. I don't know how to weight the two items. Right now, i used a stupid method, given a document d, i can obtain a rank D based on keyword method. Also, it is annotated with a concept c (The most simple example) . People can have a rank C of these concepts in the domain ontology, where the most relevant concepts should be the at top of this concept list. Finally, document's rank is decided by the sum of (C + D). hmm, if you index the concepts e.g. based on ist URI in a Lucene Filed you can set a boost value at indexing time like this: Field conceptField = Field.Text(classification, http://concepts.server.com/classification/car/mercedes;) conceptField.setBoost(1.3f); So your concept for this document, where the filed is added, is boosted in relevance computation. if you know the concept boost value at search time you can add the boost value to the query: e.g. querying for classification:http://concepts.server.com/classification/car/mercedes^ 4 Of course you have to think about the whole thing, but I think with good boost values it would work. - mathias ps. instead of C+D I would use (l-1)*C + l*D, so l from [0,1] can be used to specify if concept or content has more influence. I will compute each concepts relevant to each query. Thus, i cannot set the boost value. Actually, I use the (l-1)*C + l*D method in my prototype. But my supervisor said this method is funny as it is too simple. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing
Re: Use the lucene for searching in the Semantic Web.
On 1/20/06, Klaus [EMAIL PROTECTED] wrote: Hi, Actually, my problem is that, for instance, for a document d, Its feature vector may be keywords and concepts. What do you exactly mean by features vector? You are referring to the predicate - object pairs, connected to one subject node, don't you? The feature vector may be bigger than the object-predicate pairs. In my application, each document may be annotated with several concepts to say this document contains an instance of a class. Thus, each document will be indexed like (keywords in the document, concepts in the document) -- URI. I don't know how to weight the two items. Right now, i used a stupid method, given a document d, i can obtain a rank D based on keyword method. Also, it is annotated with a concept c (The most simple example) . People can have a rank C of these concepts in the domain ontology, where the most relevant concepts should be the at top of this concept list. Finally, document's rank is decided by the sum of (C +D). I'm going to implement something like a pagerank algorithm for my search engine. In Contrast to the google approach I cannot just count the edge, of one node, because of the know semantic I can weight them. Of course this implies a knowledge of the domain ontology. For instance if there is a predicate cited_in_document I could rank a document higher, if it is often cited. But I'm not sure about the results... I am very interesting at your approach. You can see the page rank like method used in the SWOOGLE. But the relations they used only some simple relations, Such as import (used in OWL files). IF we can use the Semantic level relations, It's should be better. But I am not sure it can succeed, as it requires how to weight the relations. Klaus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing
Re: Use the lucene for searching in the Semantic Web.
Hi, I have done some surveys about the information retrieval on the Semantic Web, (maybe i miss many papers, most papers i used are published in recent WWW and CIKM conferences, :). 1. A typical way of using the ontology is to select exact term from the domain ontology to form queries. The first one may be the OntoSeek ( www.loa-cnr.it/Papers/OntoSeek.pdf ). Similar work may be Latifur Khan's work (Retrieval effectiveness of an ontology-based model for information selection, VLDB 2004). 2. Guha et al. (Semantic Search, WWW2003) used domain ontology to form a concept graph. Then, users only need to browse the concept graph egenerated. Similar work may be Eero Hyvonen's work MuseumFinland. They all used the semantic structure of the domain ontology to help users browsing. 3. QuizRDF (www.cs.rutgers.edu/~shklar/www11/final_submissions/paper6.pdf) used another kind of method for using the domain ontology. Klaus, I think your method should be better than QuizRDF. One interesting method i found is Roha's work (A hybrid approach for searching in the Semantic Web, WWW2004). They still used keyword based method for retrieving documents on the Semantic Web. But i cannot find any more information about their work and the application i am building can be seen an extension of their work. Actually, the swoogle focuses on the ontology level's files only. It will crawls RDF, OWL DAML files. But they do not provide any new method to combine the traditional keyword method for searching the text files. Li Ding used a variant of page rank method for ontology files. But i am not sure this method can be combined with the page rank method. Maybe i have missed too many things when i do this survey. However, I think we may can find some good new methods of using the domain ontology in the Semantic Web. Yours truly, Jiang Xing On 1/19/06, Klaus [EMAIL PROTECTED] wrote: Hello, Hi, I think one problem of the existing method is that, to query on the RDF files or similar structures, we have to form SQL like queries. However, for searching in the text files, we only need to type several keywords. Can we combine the two methods and how can we combine the two methods. For instance, i only need to enter some keywords. Yes you are right. At the moment I offer the users a UI where the can input some keywords and in addition to this some rql like query via drop down menus. With the help of the this semantic query, they can specify the results demarcate the result set, e.g. saying that all result's should belong to one class, or deal with one theme. Now I try automate the generation of the query... But I'm not sure how to do this exactly. Maybe I will use some kind of pseudo relevance feedback to make some semantic analysis an the first result set. Why do we have to learn some SQL like language for searching in the Semantic Web. Maybe this paper can help you... Primary the semantic web is for agents and so on, not for humans. So the information has to have a structure, which can be exploited. http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF 21 By the way, maybe you should take a look at http://swoogle.umbc.edu/ There is also quite a big number of papers an scholar.google. Do your have any ideas, right now? Peace Klaus -- Regards Jiang Xing
Re: Use the lucene for searching in the Semantic Web.
Hi, I think one problem of the existing method is that, to query on the RDF files or similar structures, we have to form SQL like queries. However, for searching in the text files, we only need to type several keywords. Can we combine the two methods and how can we combine the two methods. For instance, i only need to enter some keywords. Then, the system can handle the left process. Why do we have to learn some SQL like language for searching in the Semantic Web. regards Jiang Xing On 1/18/06, Klaus [EMAIL PROTECTED] wrote: Hi Jiang, I'm currently facing a similar problem. Up to now I have to use for the semantic query a graph matching algorithm, but the fulltext search in the semantic web is performed by lucene. At first I wrote the whole text into a one index. The document contains one field for the unique id and on for the whole text. For the semantic markup I use an extra index. Every rdf triple will result in a document with the following fields id, predicate + subject + object. Every query is executed on both indexes. I use an extra index for the rdf data, because this results in a higher score for the documents. You might argue that this would adulterate the result, but from me point of view explicit Meta data should be higher scored then terms in document body. Cheers, Klaus -Ursprüngliche Nachricht- Von: jason [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 17. Januar 2006 15:35 An: java-user@lucene.apache.org Betreff: Use the lucene for searching in the Semantic Web. Hi friends, How do you think use the lucene for searching in the Semantic Web? I am trying using the lucene for searching documents with ontological annotation. But i do not get a better model to combine the keywords information and the ontological information. regards jiang xing - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards Jiang Xing