Re: TrecContentSource and docname/iteration number

Robert Muir Thu, 12 Nov 2009 20:11:23 -0800

Shai, ok. I couldn't tell if it was round iterations (looked that way).

I left the default as it was, for back compat. but now theres an obscure
option content.source.excludeIteration you can use to disable it.


I used content.source.* because i saw the other ContentSources seemed to do
a similar thing (reuters, etc), although I didn't implement any code to
respect it there.

On Thu, Nov 12, 2009 at 10:54 PM, Shai Erera <ser...@gmail.com> wrote:

> I think the fix you've made makes sense. The iteration number is added in
> case you want to collect more than avail documents (such that it starts over
> with the first one). I don't think it has to do with the iterations option
> in Benchmark, although it could.
>
> Being able to configure it makes sense to me. What's the default? I
> personally don't mind if it would be without iterations ...
>
> BTW, we could decide not to allow configuring it, and only if there is a
> second iteration, the code would add _<iter> to the names. So that names
> would be DOCID0001 and in the second iteration DOCID0001_0 (or _1).
>
> Shai
>
>
> On Thu, Nov 12, 2009 at 8:53 PM, Robert Muir <rcm...@gmail.com> wrote:
>
>> If I use TrecContentSource to index a collection, it puts the doc name
>> into the docname field, just as I like.
>> say i have a doc with
>> <DOCNO>DOCID0001</DOCNO>
>> the problem is that concatenates the iteration number to this document
>> name:
>>
>> name = name + "_" + iteration;
>>
>> this produces a docname of DOCID0001_0, which won't work if I am trying to
>> use the quality package to measure relevance.
>>
>> Does anyone object to changing TrecContentSource to *not do this* ???
>> I would think the primary reason you would want to use it would be to
>> measure relevance.
>>
>> alternatively, we could change DocNameExtractor in the quality package to
>> ignore this _Iteration suffix... doesn't matter to me.
>>  --
>> Robert Muir
>> rcm...@gmail.com
>>
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: TrecContentSource and docname/iteration number

Reply via email to