RE: Performance: compound vs. multi-file index, indexing and searching

Dmitry Serebrennikov Sun, 13 Jun 2004 11:17:06 -0700

Hui, thanks for running the numbers and providing this to the list!

The results are very interesting. Of course, the primary purpose of the compound file format is to reduce the use of filehandles (at the expense of performance, and especially during indexing). I put your data into Excel to try to summarize it. Here's what I found:

Indexing
Traditional, min merge docs 10:  1458110
Compound, min merge docs 10:   1619580 (+11%)
Traditional, min merge docs 100:   1321379 (-9%)

I wonder what the compound at 100 would have looked like?

Searching Indexes Trad 10 Comp 10 Trad 100 0 37.8 46.8 40.6 1 12.4 15.6 12.6 2 28.4 31.4 37.6 3 18.6 21.8 31.2 4 31.0 28.4 28.4 5 21.8 22 34.4 6 31.0 25.2 31.4 7 37.4 28.6 34.4 8 37.2 34.4 34.2 9 37.4 34.4 34.8 10 40.6 50.4 40.8 11 56.2 40.6 47 12 59.6 43.4 13 43.8 46.8 14 40.6 34.6 15 47.0 37.4 16 37.4 17 50.2 18 37.4 19 41

I'm also attaching a graph of this data, but not sure if it will come through on the mailing list. I think there is too much variation in data here to say anything interesting. The numbers above are averages across all of your queries (regardless of how many hits they brough back), for a given number of "open indexes". I'm assuming that this is the number of indexes opened by the merged searcher. I think most of the irregularity comes from ocasional GC activity. Perhaps if you could run with -verbose:gc option and then subtract the time in GC from the search times, we might get a more stable set of results. That would show the true difference of the code speed. You could also then graph the amount of memory collected by the GC or the time spent in GC to see if compound files cause more or less GC.

Could you please tell the list what kind of VM, hardware, OS was used in this test? Also, was there just one thread that was searching in all these cases? I found that these types of tests vary widely on different OSes and hardware. In my experience, Solaris/Sparc/SCSI gives slower response but is much more consistent in how it scales with the increase in the number of concurrent searches. Whereas a Windows/Intel/IDE can be dramatically faster on lower workloads, but starts to become erratic as the number of concurrent searches goes up.

I'm going to also attach a Python script that can help analyzing GC output of Sun's 1.3 VMs. One problem with those VMs is that they do not print when a given GC occured. This makes it hard to analyze. It is possible, however, to have a thread that sleeps for 1 second and then prints "[TS 234234234]" to the System.out stream, where the number of the value of System.currentTimeMillis(). The script I'm including will parse that kind of data and produce a comma-separated output of GC stats that can be graphed more easily. 1458110


Hope you guys find the above useful.
Good luck!
Dmitry.

#!/usr/local/bin/python


import sys

text = open(sys.argv[1], "r").read()
pos = 0
end = len(text)
timestamp = 0
gcCount = 0                        # number of quick GCs in this timeslice
fullgcCount = 0            # number of full GCs in this timeslice
kbCount = 0                        # kb memory collected so far in this timeslice
kbUsed = 0                         # most recent amount of kb used in VM
kbLastUsed = 0         # used memory figure at the end of the prior timeslice
kbCollected = 0        # total memory collected from start of log
secsCount = 0
print 
"stamp,gc,fullgc,kb,secs,time,gcsec,fullgc/sec,kb/sec,gcsecs/sec,kbUsed,kbTotal,kbUsed/sec"
fieldSeps = ('K->','K(', 'K), ',' secs')

while pos > -1 and pos < end:
        if text.startswith("[TS ", pos):
            # timestamp entry
                ts1 = pos+4
                ts2 = text.find("]", ts1, ts1+50)
                if ts2 > 0:
                        t = long(text[ts1:ts2])
                        if timestamp > 0:
                                length = (t - timestamp) / 1000.0
                                
                                print "%i,%i,%i,%i,%f,%i,%f,%f,%f,%f,%i,%i,%f" % (
                                      timestamp, gcCount, fullgcCount, kbCount, 
secsCount,
                                          length, 
                                          gcCount/length, 
                                          fullgcCount/length, 
                                          kbCount/length,
                                          secsCount/length,
                                          kbUsed,
                                          kbUsed + kbCollected,
                                          (kbUsed - kbLastUsed + kbCount)/length)
                        timestamp = t
                        gcCount = 0
                        fullgcCount = 0
                        kbCount = 0
                        kbLastUsed = kbUsed
                        secsCount = 0
                        pos = ts2 + 1

        elif text.startswith("[GC ", pos):
                # plain GC entry
                pos += 4
                p2 = text.find("]", pos, pos+50)
                if end > 0:
                        gcCount += 1
                        tmp = text[pos:p2]
                        values = []
                        for fs in fieldSeps:
                                sp = tmp.split(fs)
                                values.append(sp[0])
                                tmp = sp[1]
                        kbCount += long(values[0]) - long(values[1])
                        kbCollected += kbCount;
                        kbUsed = long(values[1])
                        secsCount += float(values[3])
                        pos = p2 + 1

        elif text.startswith("[Full GC ", pos):
                # full GC entry
                pos += 9
                p2 = text.find("]", pos, pos+50)
                if end > 0:
                        fullgcCount += 1
                        tmp = text[pos:p2]
                        values = []
                        for fs in fieldSeps:
                                sp = tmp.split(fs)
                                values.append(sp[0])
                                tmp = sp[1]
                        kbCount += long(values[0]) - long(values[1])
                        kbCollected += kbCount;
                        kbUsed = long(values[1])
                        secsCount += float(values[3])
                        pos = p2 + 1

        pos = text.find("[", pos+1)

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Performance: compound vs. multi-file index, indexing and searching

Reply via email to