Re: [ADVANCED-DOTNET] New Type decision Criterias

J. Merrill Fri, 13 Jan 2006 19:20:46 -0800

If the process now takes 4 hrs instead of 4 hrs 25 secs, then it is not "a lot 
faster".  Even if it's 4 mins 25 secs vs 4 mins, you've not done much (less 
than 10%).  What is the percentage improvement?  (Or, how fast is it now vs how 
fast it was?)


I would guess that you could drastically improve the performance, as you did 
not mention having done any of the SQL-related things -- having nothing to do 
with your choice of datatypes for your collection items -- that could make 
things go (much) faster.  In my view, if you did not use a profiler to figure 
out where the bulk of the time in your code is going, you may well have 
optimized something that took only a small percentage of the total run time -- 
even if you reduced the time of some part that used to take (say) 15% of the 
total time to 0 (not likely!), you would not improve the overall timing by more 
than 15%.  I would guess that any of these would make the "insert rows in 
database" part faster:

(1) You mentioned a SQL property of your collection class.  If declared as a 
normal (instance) property, it should be coded to return a value stored in a 
class-level (static) variable.  Otherwise, you're (conceptually) creating 
22,000 copies of the same SQL statement!  (Due to CLR internal smarts, you 
might "only" be creating 22,000 pointers to the same SQL statement.  Either 
that, or each instance's SQL is different because you're embedding the data in 
each individual SQL statement -- then you have to worry about SQL injection, 
not to mention wasting a lot of time stuffing the text in.)  But you should be 
getting the SQL statement only once and using parameters (see 2), so having it 
be a static property (or a static GetSQL method) makes more sense.

(2) You did not indicate that you used a prepared statement that was run once 
for each record after setting parameter values from the current in-memory 
collection item.  If you are not doing that, that technique would likely make 
things quite a bit faster (and avoid SQL injection issues).

(3) You did not mention what, if anything, you are doing related to SQL 
transactions.  If you are doing things one record at a time without explicitly 
using transactions, there is an implicit "start transaction / end transaction" 
around each insert.  If you are starting a transaction, inserting all 22,000 
records, and then committing, you are putting a burden on the server as it 
needs to be able to roll back.  (But that's the only way to make 100% sure that 
you don't add some but not all of the records.)  Optimal performance is 
typically achieved, I think, by using a separate transaction around 
medium-sized (perhaps 200-500 or so) groups of inserts.

(4) If you're using .Net 2.0, look at the SqlBulkCopy class; I haven't looked 
at it carefully, so I don't know if it's useful or not for your situation.

(5) If you are working with SQL Server, you might achieve an enormous speedup 
by writing a text file and using the "bulk insert" statement to suck the data 
in all at once.  (Even if you need to do validation on the contents of the data 
file, perhaps you could pass the raw input file to "bulk insert", write the 
results to a temporary table, then use SQL to do the validation.  For example, 
if the file has an initial header that gives the sum of all the values in one 
column, you could have "bulk insert" skip that record, use "select 
sum(totalcol) from #temptbl" to get the sum, and read (only) the header in your 
program to check against that result.  It would take an incredibly small amount 
of time for such a SQL statement to run against a temp table.)

Unfortunately, using "bulk insert" is often hard to set up in a locked-down 
production environment, as you need to either have the text file exist on the 
server (when you probably should not have a way to write to files on the SQL 
Server machine) or to have both the server and every machine running the app 
you're writing able to access the same file over the network (when the SQL 
Server process would normally not need to have network access).

(6) If you're using SQL Server, you can create an XML document (with as little 
XML overhead as possible) of the form
  <root>
    <r f1="value" f2="value" f3="value" .../>
    <r f1="value" f2="value" f3="value" .../>
    (21,198 more like that)
  </root>
and use sp_xml_preparedocument and OPENXML on the server to "shred" the XML 
document into a table variable returned from a table-valued function that 
receives the result of sp_xml_preparedocument as a parameter.  Note that the 
names of the attributes in the XML do not need to be the actual column names -- 
one-letter names (a b c etc) would be even better than the f1 f2 f3 etc in that 
sample, to shorten the XML.  You can write a stored proc similar to this code, 
or use a multi-statement SQL command like
    declare @doc int
    exec sp_xml_preparedocument @doc output, @xml
    insert yourtable(f1, f2, f3...) select * from tablefn(@doc)
    exec sp_xml_removedocument @doc

Then you pass the @xml value using a parameter of type TEXT (with a large 
enough Size).  You can break the 22,000 records into chunks if you want to and 
run that repeatedly with the XML for a few hundred records at a time.  (You 
can't call the XML stored procs from within a function, so the function has to 
take the document id rather than the XML text as parameter.)

Warning -- if you use that technique, you have to use XML character entities 
like &lt; &gt; &quot; to replace <>" chars in any text columns.

In both cases 5 and 6, don't neglect to set the Capacity of the StringBuilder 
you use to hold the text/xml document you create to a moderately large value 
(e.g. 65536) to avoid doing more re-allocations than are needed as more text is 
added.  (Don't even _dream_ of using + to build the text!)

There is an oft-quoted hard-to-attribute thought (because it's been said by 
many and I don't know who said it first)
     Premature optimization is the root of all evil
and optimizing in-memory operations (like with class vs struct) when your 
database actions are what's really slow is "premature" (vs optimizing the 
database work).

Good luck!!!

At 12:45 AM 1/13/2006, Girish Jain wrote
>Hi,
>
>Thanks to all of you for your valuable inputs. I did some performance
>testing and the results are surprising.
>
>The new type holds all the data to be inserted in to the database (in the
>local class variables) and has a property namely SQL which returns the
>INSERT SQL statement. By iterating over the items of the collection using
>for each loop I am inserting the records into the database. For reading the
>lines from the file I am using regular expressions.
>
>When I made the new type as value type and with some 22,000 records it took
>approx 25 seconds less time. With several tests, this time ranged between 23
>to 25 seconds. Therefore, with value type the performance is really good.
>
>Now, if somebody can throw some light on this AND highlight the criterias
>(not generic ones) that I should always be considering for value types or
>reference types with regards to **memory** usage, because every time I dont
>want to decide the type after doing this performance testing.
>
>
>Regards,
>Girish Jain
>[snip]


J. Merrill / Analytical Software Corp

===================================
This list is hosted by DevelopMentor®  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Re: [ADVANCED-DOTNET] New Type decision Criterias

Reply via email to