[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

Shai Erera (JIRA) Sun, 14 Jun 2009 20:32:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719379#action_12719379
 ]


Shai Erera commented on LUCENE-1595:
------------------------------------

bq. So its not like I'm sitting here saying this is a huge deal - but I think 
it should def be considered a bit.

I didn't mean to imply that. In all my recent contributions, back-compat played 
a major role. I just explained here why here I didn't give it a second thought, 
and was actually happy I am more free to make these changes. But I definitely 
see your point, and would love to get another committer's opinion too.

{quote}
What about these changes? Are they incompat as well?

-doc.add.log.step=500
-doc.delete.log.step=100
+log.step=500
+delete.log.step=100
{quote}

I took another step here, adding code to PerfTask.tearDown() which logs 
messages, and changed all the current tasks that does it to stop doing it, and 
instead override a getLogMessage(). That consolidated the logic behind when to 
log messages, in what format etc. It was not consistent between tasks, and some 
newer tasks did better job than others.

With that, I also changed the property name (which was invalid even before - 
doc.add.log.step wasn't used just in AddDocTask). About the delete.log.step - I 
first removed it, relying on the new log.step, but then spotted some .alg which 
differentiate between when how often to log messages for delete and how often 
for the rest, so I re-instated it. If you think it matters, I can change revert 
the name back to doc.delete.log.step.

bq. We will get it in for 2.9

Great ! that will relieve some of my custom benchmarking code, and allow me to 
test on more content sources (today I implemented this model just for TREC for 
lack of time).

> Split DocMaker into ContentSource and DocMaker
> ----------------------------------------------
>
>                 Key: LUCENE-1595
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1595
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch
>
>
> This issue proposes some refactoring to the benchmark package. Today, 
> DocMaker has two roles: collecting documents from a collection and preparing 
> a Document object. These two should actually be split up to ContentSource and 
> DocMaker, which will use a ContentSource instance.
> ContentSource will implement all the methods of DocMaker, like 
> getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
> 1591, by having a basic ContentSource that offers input stream services, and 
> wraps a file (for example) with a bzip or gzip streams etc.
> DocMaker will implement the makeDocument methods, reusing DocState etc.
> The idea is that collecting the Enwiki documents, for example, should be the 
> same whether I create documents using DocState, add payloads or index 
> additional metadata. Same goes for Trec and Reuters collections, as well as 
> LineDocMaker.
> In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
> 99% the same and 99% different. Most of their differences lie in the way they 
> read the data, while most of the similarity lies in the way they create 
> documents (using DocState).
> That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
> (just the reuse of DocState). Also, other DocMakers do not use that DocState 
> today, something they could have gotten for free with this refactoring 
> proposed.
> So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
> Line, Simple), I can write several DocMakers, such as DocStateMaker, 
> ConfigurableDocMaker (one which accpets all kinds of config options) and 
> custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
> instance and reuse the same DocMaking algorithm with many content sources, as 
> well as the same ContentSource algorithm with many DocMaker implementations.
> This will also give us the opportunity to perf test content sources alone 
> (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
> creating a Document object.
> I've already done so in my code environment (I extend the benchmark package 
> for my application's purposes) and I like the flexibility I have. I think 
> this can be a nice contribution to the benchmark package, which can result in 
> some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

Reply via email to