The problem as I see it is that if you want to implement a "true" Tf-Idf
similarity, e.g. as specified in the books, you have no way to do that by
extending TfIdfSimilarity which is odd.

I think that we can make TfIdfSim implement the core parts of Tf-Idf,
letting extensions worry about the details. And whoever wants to change a
single marginal thing, should extend DefaultSim. If he wants to gain finer
control e.g. over norms, queryNorm etc., he can extend TfIdfSim. And if he
wants to implement something else completely, well, he can extend
Similarity.

Right now, I need to copy most of the "tf-idf" code into my Sim, and I
don't think that's good software engineering. How many people really extend
Tf-Idf that the API can get complicated?

Shai


On Tue, Jun 25, 2013 at 4:11 PM, Robert Muir <[email protected]> wrote:

>
>
> On Tue, Jun 25, 2013 at 8:54 AM, Shai Erera <[email protected]> wrote:
>
>> Hi
>>
>> Today TfIdfSimilarity forces the encoding of norms into a single byte,
>> and there's no way to override it. E.g. if I don't want to lose precision,
>> the only thing I can do is write a different Similarity while copying most
>> of the code from TfIdfSimilarity.
>>
>
> But as you said, its expert enough :)
>
> I'm a little worried about how complex this would make the API. Today
> TFIDFSimilarity hides all of this stuff and only provides a simple API with
> tf(), idf(), etc for tuning. I think thats really how they all should
> work...
>

Reply via email to