Shai, ok. I couldn't tell if it was round iterations (looked that way). I left the default as it was, for back compat. but now theres an obscure option content.source.excludeIteration you can use to disable it.
I used content.source.* because i saw the other ContentSources seemed to do a similar thing (reuters, etc), although I didn't implement any code to respect it there. On Thu, Nov 12, 2009 at 10:54 PM, Shai Erera <ser...@gmail.com> wrote: > I think the fix you've made makes sense. The iteration number is added in > case you want to collect more than avail documents (such that it starts over > with the first one). I don't think it has to do with the iterations option > in Benchmark, although it could. > > Being able to configure it makes sense to me. What's the default? I > personally don't mind if it would be without iterations ... > > BTW, we could decide not to allow configuring it, and only if there is a > second iteration, the code would add _<iter> to the names. So that names > would be DOCID0001 and in the second iteration DOCID0001_0 (or _1). > > Shai > > > On Thu, Nov 12, 2009 at 8:53 PM, Robert Muir <rcm...@gmail.com> wrote: > >> If I use TrecContentSource to index a collection, it puts the doc name >> into the docname field, just as I like. >> say i have a doc with >> <DOCNO>DOCID0001</DOCNO> >> the problem is that concatenates the iteration number to this document >> name: >> >> name = name + "_" + iteration; >> >> this produces a docname of DOCID0001_0, which won't work if I am trying to >> use the quality package to measure relevance. >> >> Does anyone object to changing TrecContentSource to *not do this* ??? >> I would think the primary reason you would want to use it would be to >> measure relevance. >> >> alternatively, we could change DocNameExtractor in the quality package to >> ignore this _Iteration suffix... doesn't matter to me. >> -- >> Robert Muir >> rcm...@gmail.com >> > > -- Robert Muir rcm...@gmail.com