Re: Apache logs and data

Grant Ingersoll Mon, 19 Nov 2007 12:55:33 -0800


On Nov 19, 2007, at 3:41 PM, Chris Hostetter wrote:

: info, etc. could be stripped fairly easily. So, we wouldn'tnecessarily know: who is searching for "Yonik Seeley" when we see that query term,just that it: was searched for. Maybe we can inquire to infrastructure what iseven
It's a largely theoretical arguement (particularly relating to asubset ofresults on a specific domain as opposed to a subset from a specificsearch
engine) but the nutshell is: there may in fact be identifiable info in
the query string itself, so it's good to have some sanity checkingbefore
exposing the queries to the world.


Agreed.

: At any rate, I think the bigger issue is finding a good set ofdata and query: logs that we can use. An alternate way is to just start creatinga query set: based on the Wikipedia data, but that isn't as "real world" asquery logs are.
I think looking at refer URLs containing query strings grouped byTLP sitewould give us lots of useful "small" collections of docs and querystringsthat are considered "relevent" (albeit: not by a human judgement,but by
some other search engine -- it's a start)
if you take something like the online HTTPD manual, each URL can beeasilymapped to a machine parsable XML version, and i'm sure we can findplenty
of good query strings in the refer logs for httpd.apache.org.

+1

: Here's another possible thought: What if we took our own java-user mailing: list for a time period and we used the subject line or some otherpiece of: info in the text (maybe we can automatically identify questions(not hard to: do for simple cases (just identify sentences ending in ?), whichwould give us
: enough, methinks) and treat them as queries?  This may be a decent

two concerns i would have:
1) the person asking the question doesn't always know what to askabout
    (the X/Y problem) which could lead to missleading query/result
    matches.
 2) people aren't always "on topic" ... discussions can branch/evolve
without subjects changing (formatl documentation doesn't reallyhave
    this problem)

Both true, but as with the other scenarios (except TREC) there is ahuman in the loop and we don't have to take every question available,just 100 or so good ones. Maybe we could even use the FAQs appliedagainst the archive.

The other hard part about the mail archive is that you are likely tohave matches against emails which are asking the question and not justthose emails answering the question. Not sure if those are relevantor not. Sometimes, for me, just reading how someone else phrased theproblem is enough to spur an answer.

: Of course, we could see if there is a way to purchase the TREC data
: (donations, anyone?) and make it available to committers onzones. This is
if spending money is an option, but spending enough money for TRECisn'tan option, something i've been considering is using Amazon'smechanicalturk to generate judgements ... take some seed data (ie: refer logquery
strings and the title/summary/url of the top 5 URLs for each) and give
mturk users $0.05 to rank those 5 in order of how well they match.

I believe the TREC collection costs somewhere around $300, so it isn'tgoing to break the bank. Perhaps we could ask the board to pay it ormaybe we could arrange for donations. I'd be willing to kick in up to$50 to have it available, but I still don't like this route since onlycommitters can have access b/c it is on zones and I don't know thatthis is that high of a priority for committers. Instead, I wantsomething researchers and upstart grad students can easily downloadand try out and that we can all then discuss b/c we all have thedata. Furthermore, by having multiple data sets, we can hopefullyavoid the overtuning problem.


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

Reply via email to