If you look at people.apache.org:~bimargulies/dpf-bench.log (http://people.apache.org/bimargulies/dpf-bench.log should also work), you'll see the results of a luceneutil run that compares DPF to 'normal' on the 10M wikipedia case. Some things are better, some are worse, some are the same.
The claim here was never that DPF was some sort of universal solvent; it was that for certain applications it made a material speedup, and so it was worth some API complexity to liberate it from the codecs. I'm going to assert here that these results support the claim well enough to justify taking a run at the API, and then we'll see if I can come up with something that people find tolerable in proportion to the benefit. On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies <[email protected]> wrote: > On Thu, Apr 3, 2014 at 11:37 AM, Michael McCandless > <[email protected]> wrote: >> Is the benchmark just trying to measure speedups by using DirectPF vs >> the default PF? You could do this today w/ luceneutil (using >> Wikipedia as content). >> >> But if you have another content source / index, I'm happy to run the >> benchmark. It'd be easier to make the content available (CSV, or line >> docs file format), then ship around big indices ... >> >> I have a box with 48 GB RAM. >> >> Mike McCandless > > My takeaway from the prior conversation was that various people didn't > entirely believe that I'd seen a dramatic improvement in query perfo > using D-P-F, and so would not smile upon a patch intended to liberate > D-P-F from codecs. It could be that the effect I saw has to do with > the fact that our system depends on hitting and scoring 50% of the > documents in an index with a lot of documents. > > If you can help me try to simulate this situation with luceneutil, I'd > be happy to skip the work I was about to do to build another > benchmark. > > > >> >> http://blog.mikemccandless.com >> >> >> On Thu, Apr 3, 2014 at 8:38 AM, Benson Margulies <[email protected]> >> wrote: >>> Some of you may recall that I started a thread some time ago about >>> wishing for the benefits of the direct posting format without needing >>> to use a codec. The thread landed as a challenge: show a benchmark of >>> the benefit of D-P-F. >>> >>> After a lot of distraction, I'm now in a position to build it. The >>> core is a rather large index, and to show the effect (always assuming >>> that I succeed) will take a machine with a large amount of RAM. >>> >>> One approach is for me to simply build the index involved and make it >>> available as an index. Another would be to side-step into a giant pile >>> of CSV or JSON and provide a do-it-yourself kit. >>> >>> Anyone have a preference? >>> >>> What have we got for hardware with, 40G of RAM? Anything, or will this >>> be up to individuals to try out on dayjob hardware? >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
