Re: jBASE unefficient? - distributed files

Jim Idle Sun, 22 Mar 2009 16:57:44 -0700

Pawel (privately) wrote:

Hi,

The problem isn't so much that distributed files are inefficient but
probably that the algorithm you are using is not optimal. What is the
SELECT statement you are using?

These are regular SELECT statements, eg. SELECT <distributed_file> WITH 
FIELD1 LIKE sth...


I will focus on one specific file only, but other are very much similar. 
Our distribution algorithms are always uncomplicated. They distribute 
data very well in this, discussed case.

Or at least, you THINK they do ;-)

Distribution algorithm is very simple - it uses part of date (day) 
contained in key to distribute records. So we have 32 partfiles. IDs not 
matching some pattern (say 1A5N), are put to partfile 32. For these 
matching pattern there is only 1 invocation of ICONV and OCONV.

That's not very efficient for record key calculations I am afraid, even though conversions are very fast (relatively) on jBASE 4.1, and then you have to call the subroutine in the first place of course.

Day is 
obtained from date and returned as partfile number. Procedure can not be 
simpler (few lines) I think :)

The performance problem arises when you ask for data with selection 
criteria. jBASE will start to call distribution subroutine thousands of 
times.

Well, obviously, it must. It does not know how you are calculating the item ids and yet it must traverse every item in order to test the criteria.

This will introduce enomours overhead. We usually do not need to 
ask queries like that, but for some (CSHD) investigations we are forced 
to do it like that.

If you create a list then read through
that list, you will read in the list order, which may not be optimal for
that distribution. Note that many moons ago I modified the distributed
files so that you could change the key on the fly to guarantee that the
distribution is good (not that anyone has ever used except the people I
wrote it for) Otherwise the key order you get is not necessarily the
order that is best to read through the part files and you will create
millions of random reads instead of lots of sequential reads.

I think that select is taking keys "in natural order" from partfiles, 
but I can confirm tommorow. We are using jBASE 4.1.5.17.

Well, if you get the same order for the file as you do if you LIST each individual part file, then you could be.

The main difference is that jBASE runs distribution routine for these 
"full scan" selects and I can not understand why does it need to do it?

How can it do otherwise? The list must be the list of record keys.

I guess that SELECT / READNEXT operations of jEDI driver implemented for 
distributed files are virtually handling distribution (so SELECT program 
is not aware of partfiles),

Yep. And because it is a calculated key, it probably isn't using the fastscan interface so performance will be very low in comparison.

 but just performs SELECT / READNEXT + READ 
of record.

This is inefficient, because READ introduces unnecessary overhead caused 
by calling distribution routine. Results can be obtained much faster by 
doing (direct) SELECTs on partfiles and combining output.

Yes - that is what I said you should be doing. Then there are specific routines you can use to merge lists. Or you can wait for my new file system and not bother with the distribution as you won't need it ;-)

This is however optimization for jBASE team,

No - we optimized for the general case, but if you are going to take over the key (or rather partition selection), there is nothing to be done but ask you for it.

not us I belive. We already 
raised it, but I noticed "resistance" in accepting this ticket :(

I think that the ticket isn't correct is probably the reason. This is what you get from the partitions as it stands. You should probably raise a ticket that steps back and asks for advice on choosing

A select will read the ID, then it will read the record -
non-distributed files use some neat tricks for bulk reads, whereas
distributed files probably cannot. Have you considered SELECTing each
part file individually, then merging the results? I doubt that the
distributed files can be generically optimized, but by changing the key
on read and write (computationally of course) you can probably get much
better performance.

Of course, once you can use my new file system, you won't need
distributed files and won't have this problem :-)

I need to read your post from the past. Do you need if jBASE 5 would 
help us in liquidating described problem?

I think that calling of distribution routine is not needed if you do 
full scan table. I guess that many people could benefit from such 
optimization.

Kind regards
Pawel

----------------------------------------------------
EuroBasket 2009 w Polsce!
Giganci nadchodzą, zobacz trailer.
Kliknij: http://klik.wp.pl/?adr=http%3A%2F%2Fcorto.www.wp.pl%2Fas%2Feurobasket.html&sid=668

--~--~---------~--~----~------------~-------~--~----~
Please read the posting guidelines at: http://groups.google.com/group/jBASE/web/Posting%20Guidelines

IMPORTANT: Type T24: at the start of the subject line for questions specific to Globus/T24

To post, send email to [email protected]
To unsubscribe, send email to [email protected]
For more options, visit this group at http://groups.google.com/group/jBASE?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: jBASE unefficient? - distributed files

Reply via email to