[
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719318#action_12719318
]
Shai Erera commented on LUCENE-1595:
------------------------------------
Well ... depends on what are the "existing algorithms out there". .alg files
that someone wrote which use existing DocMakers (from benchmark) would break,
but fixing them is a no brainer (just reference the ContentSource where
applicable). .alg files which use custom DocMakers are a bit more challenging,
since you'll need to decide if your DocMaker is a ContentSource, really a
DocMaker or both (i.e., split it to DocMaker and ContentSource). Since I
haven't changed the API of DocMaker much, it shouldn't be a hard task to
refactor your custom DocMaker.
In general. I believe benchmark is not used in production environments, and
therefore it shouldn't be a real problem to adapt your benchmark .alg and/or
custom classes to the refactored one. You can also earn by extending DocMaker
and using its DocState for a reuse logic. If we say that contrib in general
does not need to maintain back-compat, and we're talking about classes that are
in production environments, then I don't think we have a real issue here.
I won't ask how much we believe benchmark is extended (even though it's
important) or used, since this issue was originated from my extension of it. I
can only assume that Solr extends it too (or uses it).
> Split DocMaker into ContentSource and DocMaker
> ----------------------------------------------
>
> Key: LUCENE-1595
> URL: https://issues.apache.org/jira/browse/LUCENE-1595
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Reporter: Shai Erera
> Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today,
> DocMaker has two roles: collecting documents from a collection and preparing
> a Document object. These two should actually be split up to ContentSource and
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/
> 1591, by having a basic ContentSource that offers input stream services, and
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the
> same whether I create documents using DocState, add payloads or index
> additional metadata. Same goes for Trec and Reuters collections, as well as
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are
> 99% the same and 99% different. Most of their differences lie in the way they
> read the data, while most of the similarity lies in the way they create
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker
> (just the reuse of DocState). Also, other DocMakers do not use that DocState
> today, something they could have gotten for free with this refactoring
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC,
> Line, Simple), I can write several DocMakers, such as DocStateMaker,
> ConfigurableDocMaker (one which accpets all kinds of config options) and
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource
> instance and reuse the same DocMaking algorithm with many content sources, as
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package
> for my application's purposes) and I like the flexibility I have. I think
> this can be a nice contribution to the benchmark package, which can result in
> some code cleanup as well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]