Enis, thanks for excellent answer! Lukas
On 8/13/07, Enis Soztutar <[EMAIL PROTECTED]> wrote: > > Hi, > > Lukas Vlcek wrote: > > Enis, > > > > Thanks for your time. > > I gave a quick glance at Pig and it seems good (seems it is directly > based > > on Hadoop which I am starting to play with :-). It obvious that a huge > > amount of data (like user queries or access logs) should be stored in > flat > > files which makes it convenient for further analysis by Pig (or directly > by > > Hadoop based tasks) or other tools. And I agree with you that size of > the > > index can be tracked in journal based style in separated log rather then > > with every since user query. That is for the easier part of my original > > question :-) > > > > The true art starts with the mining tasks itself. How to efficiently use > > such data for bettering user experience with the search engine... one > potion > > is to use such data for search engine tuning, which is more technical > > oriented application (speeding up slow queries, reshaping the index, > ...). > > But I am looking for more IR oriented application of this information. I > > remember that once I read on Lucene mail list that somebody suggested > > utilization of previously issued user queries for suggestions of > > similar/other/related queries or for typo checking. While there are well > > discussed methods (moreLikeThis, did you mean, similarity, ... etc) in > > Lucene community I am still wondering if once can use user search > history > > data for such purpose and if the answer is yes then how (practical > examples > > are welcomed). > > > > > Well, the logs of the search engine is used to improve the search engine > in several ways. All the major search engines including google, yahoo > and ms uses the logs to deliver more relevant results to the user. How > they do this is a big research area. Although google tends to not > publish it's algorithms, yahoo and MS do. > > The logs can be used to improve various components of the search engine. > For example the suggesters (google suggest, yahoo suggest ) and the > spell checkers. Most of the spell checkers uses the noisy channel error > model. In this model, the user emits word w with a probability p(w). But > due to the error in the channel (that is misspelling) the user emits w' > instead. Spelling correction deals with correcting w' to w. The > calculation involves edit distance, prior probabilities of words, > probabilities of errors(prob. of writing "spel" instead of "spell" ) > and a predefined lexicon. But in the word of internet, there in no > lexicon, so given a word w' you may not know exactly if w' is misspelled > or not. The logs can be used for calculating the probabilities, building > the lexicon, and generating possible suggestions(in case of a misspelled > word). > > For spell checking you can consult : > > An improved Error model for noisy channel spelling correction, Brin et.al. > Learning a spelling error model for search query logs, Ahmad et.al. > spelling correction for search engine queries. Martis. et.al > techniques for authomatically correcting words in text. Kukish (an > excellent review of topic) > Spelling correction as an iterative process that exploits the collective > knowledge of web users. Brill et.al. (This MS paper is very good) > > For the query suggestion, yahoo has a paper about using Machine learning > methods using the log data, for query suggestion. The basic idea behind > is that, the user submits a query, then if not satisfied with the > results, refines the query and resubmits it. Looking at all the > collective data we can find possible better suggestions for a submitted > query, classify them as useful or not useful or irrelevant, then display > the useful ones. You can find the paper at yahoo research site. > > using the web log for improving search engine quality is a much harder > problem. Unfortunately, i could not find time to read more about this > topic yet. I know that MS uses a method based on neural nets, called > ranknet, which ranks the search results. The net is trained with the > server logs. Below i list some papers, but i have not read all of them > so i cannot say anything further about them : > > Accurately Interpretting Cilckthrough Data as Implicit Feedback > A Simulated Study of Implicit Feedback Models > Identifying "Best Bet" Web Search Results by Mining Past User Behavior > Improving Web Search Ranking by Incorporating User behaviour Information > Learning User Interaction Models for Predicting Web Search Result > Preferences > Optimizing_search_engines_using_clickthrough_data > Query Chains: Learning to Rank from Implicit Feedback > > > > > > Lukas > > > > On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote: > > > >> > >> Lukas Vlcek wrote: > >> > >>> Hi Enis, > >>> > >>> > >> Hi again, > >> > >>> On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote: > >>> > >>> > >>>> Hi, > >>>> > >>>> Lukas Vlcek wrote: > >>>> > >>>> > >>>>> Hi, > >>>>> > >>>>> I would like to keep user search history data and I am looking for > >>>>> > >> some > >> > >>>>> ideas/advices/recommendations. In general I would like to talk about > >>>>> > >>>>> > >>>> methods > >>>> > >>>> > >>>>> of storing such data, its structure and how to turn it into valuable > >>>>> information. > >>>>> > >>>>> As for the structure: > >>>>> ============== > >>>>> For now I don't have exact idea about what kind of information I > >>>>> > >> should > >> > >>>>> keep. I know that this is application specific but I believe there > can > >>>>> > >>>>> > >>>> be > >>>> > >>>> > >>>>> some common general patterns. as of now I think can be useful to > keep > >>>>> > >> is > >> > >>>> the > >>>> > >>>> > >>>>> following: > >>>>> > >>>>> 1) system time (time of issuing the query) and userid > >>>>> 2) original user query in raw form (untokenized) > >>>>> 3) expanded user query (both tokenized and untokenized can be > useful) > >>>>> 4) query execution time > >>>>> 5) # of objects retrieved from index > >>>>> 6) # of total object count in index (this can change during time) > >>>>> 7) and possibly if user clicked some result and if so then which one > >>>>> > >>>>> > >>>> (the > >>>> > >>>> > >>>>> hit number) and system time > >>>>> > >>>>> > >>>>> > >>>>> > >>>> Remember that you may not want to store all the information available > >>>> > >> at > >> > >>>> runtime of the query, since it may result in great performance > burden. > >>>> For example you may want to store the raw form of the query, but not > >>>> parsed form since you can later parse the query anayway (unless you > >>>> > >> have > >> > >>>> some architecture change). Similarly 6 seemed not a good choice for > >>>> me(again you can store the info externally). You can look at the > common > >>>> and extended log formats which are stored by the web server. > >>>> > >>>> > >>> The problem is that all the information do chance in time. The index > is > >>> updated continuously which means that expanded queries and total > number > >>> > >> of > >> > >>> documents in index do change as well. But you are right that getting > >>> > >> some of > >> > >>> this info can cause extra performance expenses (then it would be > >>> > >> question of > >> > >>> later optimization of architecture design). > >>> > >>> > >> Well i think you can at least store the size of the index in another > >> file, and log to the changes in the index size from there. The > >> motivation for this comes from storage efficiency. You may not want to > >> store the same index size over and over again in n queries before the > >> index size changes, but store it once, with the time, per change. > >> > >>> > >>>>> As for the information I can get from this: > >>>>> ============================= > >>>>> Such minimal data collection could show if the search engine serves > >>>>> > >>>>> > >>>> users > >>>> > >>>> > >>>>> well or not (generally said). I should note that for the users in > this > >>>>> > >>>>> > >>>> case > >>>> > >>>> > >>>>> the only other option is to not use the search engine at all (so the > >>>>> > >>>>> > >>>> data > >>>> > >>>> > >>>>> should not be biased by the fact that users are using alternative > >>>>> > >> search > >> > >>>>> method). I should be able to learn if: > >>>>> > >>>>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...) > >>>>> 2) users are finding what they want (they can find it fast and > results > >>>>> > >>>>> > >>>> are > >>>> > >>>> > >>>>> ordered by properly defined relevance [my model is well tuned in > terms > >>>>> > >>>>> > >>>> of > >>>> > >>>> > >>>>> term weights] so the result they click is among first hits) > >>>>> 3) user can formulate queries well (do they issue queries which > return > >>>>> > >>>>> > >>>> all > >>>> > >>>> > >>>>> index documents or they can issue queries which return just a couple > >>>>> > >> of > >> > >>>>> documents) > >>>>> 4) ...?... etc... > >>>>> > >>>>> > >>>>> > >>>>> > >>>> Web server log analysis is a very popular topic nowadays, and you can > >>>> check for the literature, especially clickthrough data anaysis. All > the > >>>> major search engines has to interpret the data to improve their > >>>> algorithms, and to learn from the latent "collective knowlege" hidden > >>>> > >> in > >> > >>>> web server logs. > >>>> > >>>> > >>> It seems I have to do my homework and check CiteSeer for some papers > :-) > >>> Is there any paper you can recommend me? Some good one to start with? > >>> What I want to achieve is far beyond the scope of the project I am > >>> > >> working > >> > >>> on right now thus I cannot spend all my time on research (in spite of > >>> > >> the > >> > >>> fact I would love to) so I can either a) use some tool which is > already > >>> available (open sourced) and directly fits my needs (I don't think > there > >>> > >> is > >> > >>> any tool which I could use out-of-box) or b) implement something new > >>> > >> from > >> > >>> scratch but with just very limited functionality. > >>> > >>> > >> You do not have to implement this from scratch. You just have to > specify > >> your data mining tasks, then write scripts(in pig latin) or write > >> map-reduce programs (in hadoop). Either of these are not that hard. I > do > >> not think that there is any tool which may satisfy all you information > >> needs. So at the risk of repeating myself i suggest you to look at pig > >> at write some scripts to mine the data. > >> > >> Coming to literature, i can hardly suggest any specific paper, since i > >> am not very into the subject either. But i suggest you to skip this > >> step, first build you data structures (log format), then start > >> extracting some naive statistical information from the data. For > >> example, initially you may want to know > >> 1. avarage query execution time > >> 2. avarage query execution time per query type(boolean, fuzzy, etc.) > >> 3. histogram of query types (how many boolean queries, etc.) > >> 4. avarage #of queries per user session. > >> 5. etc. > >> > >> The list can go on and on depending on the data you have and > information > >> you want. These kind of simple statistical analysis can be very easy to > >> extract and relatively easy to interpret. > >> > >> > >>>> As for the storage method: > >>>> > >>>> > >>>>> =================== > >>>>> I was planning to keep such data in database but now it seems to me > >>>>> > >> that > >> > >>>> it > >>>> > >>>> > >>>>> will be better to keep it directly in index (Lucene index). It seems > >>>>> > >> to > >> > >>>> me > >>>> > >>>> > >>>>> that this approach would allow me for better fuzzy searches across > >>>>> > >>>>> > >>>> history > >>>> > >>>> > >>>>> and extracting relevant objects and their count more efficiently > (with > >>>>> benefit of the relevance based search on top of history search > >>>>> > >> corpus). > >> > >>>>> I think that more scalable solution would be to keep such data in > pure > >>>>> > >>>>> > >>>> flat > >>>> > >>>> > >>>>> file and then periodically recreate search history index (or more > >>>>> > >>>>> > >>>> indices) > >>>> > >>>> > >>>>> from it (for example by Map-Reduce like task). Event better the flat > >>>>> > >>>>> > >>>> file > >>>> > >>>> > >>>>> could be stored in distributer file system. However, for now I would > >>>>> > >>>>> > >>>> like to > >>>> > >>>> > >>>>> start with something simple. > >>>>> > >>>>> > >>>>> > >>>> I would rather suggest you to keep the logs in rolling flat files. An > >>>> access to the database for each search will take lots of time. Then > you > >>>> may want to flush those logs to the db once a day if you indeed want > to > >>>> store the data in a relational way. > >>>> > >>>> I infer that you want to mine the data, but you do not know what to > >>>> mine, right? I suggest you to look at hadoop and pig. Pig is a is > >>>> designed especially for this purpose. > >>>> > >>>> > >>> You've hit the nail on the head! I am very curious about how one can > use > >>> such data to improve user experience with search engine (given my > >>> > >> project > >> > >>> schedule time constraints). > >>> > >>> > >>> > >>>> I know this is a complex topic... > >>>> > >>>> > >>>>> Regards, > >>>>> Lukas > >>>>> > >>>>> > >>>>> > >>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >>>> > >>> Anyway, thanks for your reply! > >>> > >>> BR > >>> Lukas > >>> > >>> > >>> > > > > >