Enis,
thanks for excellent answer!
Lukas

On 8/13/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Lukas Vlcek wrote:
> > Enis,
> >
> > Thanks for your time.
> > I gave a quick glance at Pig and it seems good (seems it is directly
> based
> > on Hadoop which I am starting to play with :-). It obvious that a huge
> > amount of data (like user queries or access logs) should be stored in
> flat
> > files which makes it convenient for further analysis by Pig (or directly
> by
> > Hadoop based tasks) or other tools. And I agree with you that size of
> the
> > index can be tracked in journal based style in separated log rather then
> > with every since user query. That is for the easier part of my original
> > question :-)
> >
> > The true art starts with the mining tasks itself. How to efficiently use
> > such data for bettering user experience with the search engine... one
> potion
> > is to use such data for search engine tuning, which is more technical
> > oriented application (speeding up slow queries, reshaping the index,
> ...).
> > But I am looking for more IR oriented application of this information. I
> > remember that once I read on Lucene mail list that somebody suggested
> > utilization of previously issued user queries for suggestions of
> > similar/other/related queries or for typo checking. While there are well
> > discussed methods (moreLikeThis, did you mean, similarity, ... etc) in
> > Lucene community I am still wondering if once can use user search
> history
> > data for such purpose and if the answer is yes then how (practical
> examples
> > are welcomed).
> >
> >
> Well, the logs of the search engine is used to improve the search engine
> in several ways. All the major search engines including google, yahoo
> and ms uses the logs to deliver more relevant results to the user. How
> they do this is a big research area. Although google tends to not
> publish it's algorithms, yahoo and MS do.
>
> The logs can be used to improve various components of the search engine.
> For example the suggesters (google suggest, yahoo suggest ) and the
> spell checkers. Most of the spell checkers uses the noisy channel error
> model. In this model, the user emits word w with a probability p(w). But
> due to the error in the channel (that is misspelling) the user emits w'
> instead. Spelling correction deals with correcting w' to w. The
> calculation involves edit distance, prior probabilities of words,
> probabilities of errors(prob. of writing "spel" instead of  "spell" )
> and a predefined lexicon. But in the word of internet, there in no
> lexicon, so given a word w' you may not know exactly if w' is misspelled
> or not. The logs can be used for calculating the probabilities, building
> the lexicon, and generating possible suggestions(in case of a misspelled
> word).
>
> For spell checking you can consult :
>
> An improved Error model for noisy channel spelling correction, Brin et.al.
> Learning a spelling error model for search query logs, Ahmad et.al.
> spelling correction for search engine queries. Martis. et.al
> techniques for authomatically correcting words in text. Kukish (an
> excellent review of topic)
> Spelling correction as an iterative process that exploits the collective
> knowledge of web users. Brill et.al. (This MS paper is very good)
>
> For the query suggestion, yahoo has a paper about using Machine learning
> methods using the log data, for query suggestion. The basic idea behind
> is that, the user submits a query, then if not satisfied with the
> results, refines the query and resubmits it. Looking at all the
> collective data we can find possible better suggestions for a submitted
> query, classify them as useful or not useful or irrelevant, then display
> the useful ones. You can find the paper at yahoo research site.
>
> using the web log for improving search engine quality is a much harder
> problem. Unfortunately, i could not find time to read more about this
> topic yet. I know that MS uses a method based on neural nets, called
> ranknet, which ranks the search results. The net is trained with the
> server logs. Below i list some papers, but i have not read all of them
> so i cannot say anything further about them :
>
> Accurately Interpretting Cilckthrough Data as Implicit Feedback
> A Simulated Study of Implicit Feedback Models
> Identifying "Best Bet" Web Search Results by Mining Past User Behavior
> Improving Web Search Ranking by Incorporating User behaviour Information
> Learning User Interaction Models for Predicting Web Search Result
> Preferences
> Optimizing_search_engines_using_clickthrough_data
> Query Chains: Learning to Rank from Implicit Feedback
>
>
>
>
> > Lukas
> >
> > On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> Lukas Vlcek wrote:
> >>
> >>> Hi Enis,
> >>>
> >>>
> >> Hi again,
> >>
> >>> On 8/10/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> Lukas Vlcek wrote:
> >>>>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I would like to keep user search history data and I am looking for
> >>>>>
> >> some
> >>
> >>>>> ideas/advices/recommendations. In general I would like to talk about
> >>>>>
> >>>>>
> >>>> methods
> >>>>
> >>>>
> >>>>> of storing such data, its structure and how to turn it into valuable
> >>>>> information.
> >>>>>
> >>>>> As for the structure:
> >>>>> ==============
> >>>>> For now I don't have exact idea about what kind of information I
> >>>>>
> >> should
> >>
> >>>>> keep. I know that this is application specific but I believe there
> can
> >>>>>
> >>>>>
> >>>> be
> >>>>
> >>>>
> >>>>> some common general patterns. as of now I think can be useful to
> keep
> >>>>>
> >> is
> >>
> >>>> the
> >>>>
> >>>>
> >>>>> following:
> >>>>>
> >>>>> 1) system time (time of issuing the query) and userid
> >>>>> 2) original user query in raw form (untokenized)
> >>>>> 3) expanded user query (both tokenized and untokenized can be
> useful)
> >>>>> 4) query execution time
> >>>>> 5) # of objects retrieved from index
> >>>>> 6) # of total object count in index (this can change during time)
> >>>>> 7) and possibly if user clicked some result and if so then which one
> >>>>>
> >>>>>
> >>>> (the
> >>>>
> >>>>
> >>>>> hit number) and system time
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> Remember that you may not want to store all the information available
> >>>>
> >> at
> >>
> >>>> runtime of the query, since it may result in great performance
> burden.
> >>>> For example you  may want to store the raw form of the query, but not
> >>>> parsed form since you can later parse the query anayway (unless you
> >>>>
> >> have
> >>
> >>>> some architecture change). Similarly 6 seemed not a good choice for
> >>>> me(again you can store the info externally). You can look at the
> common
> >>>> and extended log formats which are stored by the web server.
> >>>>
> >>>>
> >>> The problem is that all the information do chance in time. The index
> is
> >>> updated continuously which means that expanded queries and total
> number
> >>>
> >> of
> >>
> >>> documents in index do change as well. But you are right that getting
> >>>
> >> some of
> >>
> >>> this info can cause extra performance expenses (then it would be
> >>>
> >> question of
> >>
> >>> later optimization of architecture design).
> >>>
> >>>
> >> Well i think you can at least store the size of the index in another
> >> file, and log to the changes in the index size from there. The
> >> motivation for this comes from storage efficiency. You may not want to
> >> store the same index size over and over again in n queries before the
> >> index size changes, but store it once, with the time, per change.
> >>
> >>>
> >>>>> As for the information I can get from this:
> >>>>> =============================
> >>>>> Such minimal data collection could show if the search engine serves
> >>>>>
> >>>>>
> >>>> users
> >>>>
> >>>>
> >>>>> well or not (generally said). I should note that for the users in
> this
> >>>>>
> >>>>>
> >>>> case
> >>>>
> >>>>
> >>>>> the only other option is to not use the search engine at all (so the
> >>>>>
> >>>>>
> >>>> data
> >>>>
> >>>>
> >>>>> should not be biased by the fact that users are using alternative
> >>>>>
> >> search
> >>
> >>>>> method). I should be able to learn if:
> >>>>>
> >>>>> 1) there are bottleneck queries (Prefix,Fuzzy,Proximity queries...)
> >>>>> 2) users are finding what they want (they can find it fast and
> results
> >>>>>
> >>>>>
> >>>> are
> >>>>
> >>>>
> >>>>> ordered by properly defined relevance [my model is well tuned in
> terms
> >>>>>
> >>>>>
> >>>> of
> >>>>
> >>>>
> >>>>> term weights] so the result they click is among first hits)
> >>>>> 3) user can formulate queries well (do they issue queries which
> return
> >>>>>
> >>>>>
> >>>> all
> >>>>
> >>>>
> >>>>> index documents or they can issue queries which return just a couple
> >>>>>
> >> of
> >>
> >>>>> documents)
> >>>>> 4) ...?... etc...
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> Web server log analysis is a very popular topic nowadays, and you can
> >>>> check for the literature, especially clickthrough data anaysis. All
> the
> >>>> major search engines has to interpret the data to improve their
> >>>> algorithms, and to learn from the latent "collective knowlege" hidden
> >>>>
> >> in
> >>
> >>>> web server logs.
> >>>>
> >>>>
> >>> It seems I have to do my homework and check CiteSeer for some papers
> :-)
> >>> Is there any paper you can recommend me? Some good one to start with?
> >>> What I want to achieve is far beyond the scope of the project I am
> >>>
> >> working
> >>
> >>> on right now thus I cannot spend all my time on research (in spite of
> >>>
> >> the
> >>
> >>> fact I would love to) so I can either a) use some tool which is
> already
> >>> available (open sourced) and directly fits my needs (I don't think
> there
> >>>
> >> is
> >>
> >>> any tool which I could use out-of-box) or b) implement something new
> >>>
> >> from
> >>
> >>> scratch but with just very limited functionality.
> >>>
> >>>
> >> You do not have to implement this from scratch. You just have to
> specify
> >> your data mining tasks, then write scripts(in pig latin) or write
> >> map-reduce programs (in hadoop). Either of these are not that hard. I
> do
> >> not think that there is any tool which may satisfy all you information
> >> needs. So at the risk of repeating myself i suggest you to look at pig
> >> at write some scripts to mine the data.
> >>
> >> Coming to literature, i can hardly suggest any specific paper, since i
> >> am not very into the subject either. But i suggest you to skip this
> >> step, first build you data structures (log format), then start
> >> extracting some naive statistical information from the data. For
> >> example, initially you may want to know
> >> 1. avarage query execution time
> >> 2. avarage query execution time per query type(boolean, fuzzy, etc.)
> >> 3. histogram of query types (how many boolean queries, etc.)
> >> 4. avarage #of queries per user session.
> >> 5. etc.
> >>
> >> The list can go on and on depending on the data you have and
> information
> >> you want. These kind of simple statistical analysis can be very easy to
> >> extract and relatively easy to interpret.
> >>
> >>
> >>>> As for the storage method:
> >>>>
> >>>>
> >>>>> ===================
> >>>>> I was planning to keep such data in database but now it seems to me
> >>>>>
> >> that
> >>
> >>>> it
> >>>>
> >>>>
> >>>>> will be better to keep it directly in index (Lucene index). It seems
> >>>>>
> >> to
> >>
> >>>> me
> >>>>
> >>>>
> >>>>> that this approach would allow me for better fuzzy searches across
> >>>>>
> >>>>>
> >>>> history
> >>>>
> >>>>
> >>>>> and extracting relevant objects and their count more efficiently
> (with
> >>>>> benefit of the relevance based search on top of history search
> >>>>>
> >> corpus).
> >>
> >>>>> I think that more scalable solution would be to keep such data in
> pure
> >>>>>
> >>>>>
> >>>> flat
> >>>>
> >>>>
> >>>>> file and then periodically recreate search history index (or more
> >>>>>
> >>>>>
> >>>> indices)
> >>>>
> >>>>
> >>>>> from it (for example by Map-Reduce like task). Event better the flat
> >>>>>
> >>>>>
> >>>> file
> >>>>
> >>>>
> >>>>> could be stored in distributer file system. However, for now I would
> >>>>>
> >>>>>
> >>>> like to
> >>>>
> >>>>
> >>>>> start with something simple.
> >>>>>
> >>>>>
> >>>>>
> >>>> I would rather suggest you to keep the logs in rolling flat files. An
> >>>> access to the database for each search will take lots of time. Then
> you
> >>>> may want to flush those logs to the db once a day if you indeed want
> to
> >>>> store the data in a relational way.
> >>>>
> >>>> I infer that you want to mine the data, but you do not know what to
> >>>> mine, right? I suggest you to look at hadoop and pig. Pig is a is
> >>>> designed especially for this purpose.
> >>>>
> >>>>
> >>> You've hit the nail on the head! I am very curious about how one can
> use
> >>> such data to improve user experience with search engine (given my
> >>>
> >> project
> >>
> >>> schedule time constraints).
> >>>
> >>>
> >>>
> >>>> I know this is a complex topic...
> >>>>
> >>>>
> >>>>> Regards,
> >>>>> Lukas
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>
> >>>>
> >>> Anyway, thanks for your reply!
> >>>
> >>> BR
> >>> Lukas
> >>>
> >>>
> >>>
> >
> >
>

Reply via email to