New topic: Achieving the best speed possible?
<http://forums.realsoftware.com/viewtopic.php?t=47072> Page 1 of 1 [ 11 posts ] Previous topic | Next topic Author Message Zack.H Post subject: Achieving the best speed possible?Posted: Mon Feb 25, 2013 9:25 pm Joined: Sun Jun 15, 2008 1:50 am Posts: 79 I originally had this posted in the Web section as what I'm dealing with is a web application but this portion of that application isn't really web related. Assuming most of the real hardcore RB guys don't check the Web section I decided to move it here. Currently my issue is I'm trying to achieve the best speed possible while attempting to process 700,000+ lines of data. The way I "process" this data is I first read it all into a string then I generate some unique ID's Those ID(s) currently consist of MD5Digest values the amount varies there's 3 static ones and then there's additional ones based on user input. With the sample test data I have it generates 6 md5 values in total which are meant to be unique every time. After generating the said ID(s) it stores everything into a postgresql database due to the fact I was using SQLite previously and it seemed to slow things down dramatically, I've been told that Postgresql would probably be the better/faster solution to handle this amount of data. Now these 700,000 records would be used over and over again and need to be re-processed every single time they're used to generate new Unique ID's for the user's new specifications. So it's really important that there isn't even the slightest delay while processing, Currently what I'm looking into is ANY alternative string handling/hashing plugins or modifications which could make this process faster without any delay. Currently my only ideas are to use plugins which have been advertised to handle faster string processing/handling/etc. And the MBS plugin in hopes that it processes the MD5Digest faster then the internal implementation of such. Doing the math on this if I were to even have a 500 millisecond delay between each record that's 500 milliseconds times 700,000 in total that your total processing time becomes, So in reality what I'm trying to reach is at most 1-10min processing of the 700k records which I'm assuming will take some extreme code revamping/re-working on top of external plugins. That'd be about 100-300 milliseconds that each record would need to be completed processing in (My math is probably way off here). So if anyone has any suggestions on how to optimize this any further any paid plugins proven to preform faster(Please provide some benchmarking/test results before I purchase any of the products), It would all be greatly appreciated and I'm open to just about any suggestions as I've gotta deadline to meet on this project and really don't wanna start porting my processing code into my C++ side of things as that seems like it would take way too long now that I've gotten an implementation already here that works and is just slow. The code I'm working with is as follows ( It's been modified a lot since the original implementation to improve the speed of it's processing but it's still not fast enough it takes 1hr+ to process those previously mentioned 700,000 records ): GenerateProgress.Visible = True GeneratingText.Visible = True Dim F As FolderItem = GetFolderItem("./Lists/"+File) Dim T As TextInputStream Dim Data As String Dim Lines() As String Dim Emails() As String Dim dbFile as FolderItem Dim ImageHeaders() As String Dim ImageUrl As String Dim ImageId() As String Dim R As Random = new Random R.RandomizeSeed() Dim Rand As String = str(R.InRange(0,9000)) Dim db As New PostgreSQLDatabase db.Host = "127.0.0.1" db.Port = 5432 db.DatabaseName = "cmpsystem" db.Username = "cmpsys" db.Password = "test22" if db.Connect() Then T = T.Open( F ) Data = T.ReadAll() Lines = Data.SplitB(EndOfLine.Windows) For Each Row as String In Lines Dim Columns() As String Columns = Row.SplitB(",") for each Column As String In Columns If( Column.InStrB("@") > 0 ) Then Emails.Append(Column) end next Next for i as integer = 0 to Images.Ubound() //ImageHeaders = ImageHeaders+"Img"+str(i)+"," ImageHeaders.Append("Img") ImageHeaders.Append(str(i)) ImageHeaders.Append(",") next Dim O As FolderItem = GetFolderItem("./Lists/"+File+"_Out.csv") Dim Tout As TextOutputStream Tout = Tout.Create(O) //Tout.WriteLine("clickid,unsubid,openid,"+ImageHeaders+"*To,") Tout.Write("clickid,unsubid,openid,") Dim ImageHeader As String = Join( ImageHeaders, "") Tout.Write( ImageHeader ) Tout.WriteLine("*To,") Dim Lists() As String db.SQLExecute("BEGIN;") For each Email As String in Emails ImageId = array("") Dim Md5ClickId as Md5Digest = new Md5Digest Dim Md5UnsubId As Md5Digest = new Md5Digest Dim Md5OpenId As Md5Digest = new Md5Digest for each Image As String in Images Dim Md5ImgId As Md5Digest = new Md5Digest Md5ImgId.Process Email Md5ImgId.Process Rand Md5ImgId.Process Image Dim ImageIdHex As String = EncodeHex(Md5ImgId.Value) ImageId.Append(ImageIdHex) ImageId.Append(",") //ImageId = ImageId + EncodeHex(Md5ImgId.Value)+"," Dim ImageIdQuery() As String //db.SQLExecute("INSERT INTO cmpsystem.urls VALUES('"+EncodeHex(Md5ImgId.Value)+"','"+Image+"','"+Email+"','img');") ImageIdQuery.Append("INSERT INTO cmpsystem.urls VALUES('") ImageIdQuery.Append(ImageIdHex) ImageIdQuery.Append("','") ImageIdQuery.Append(Image) ImageIdQuery.Append("','") ImageIdQuery.Append(Email) ImageIdQuery.Append("','") ImageIdQuery.Append("img');") Dim ImageQuery As String = Join( ImageIdQuery, "") db.SQLExecute(ImageQuery) Md5ImgId = Nil next Md5ClickId.Process Email Md5ClickId.Process Rand Md5ClickId.Process ClickUrl Md5UnsubId.Process Email Md5UnsubId.Process Rand Md5UnsubId.Process UnsubUrl Md5OpenId.Process Email MD5OpenId.Process Rand Dim MD5ClickIDHex As String = EncodeHex(MD5ClickID.Value) Dim MD5UnsubIDHex As String = EncodeHex(MD5UnsubID.Value) Dim MD5OpenIDHex As String = EncodeHex(MD5OpenID.Value) if RandomizeLists.Value = True Then //Lists.Append(EncodeHex(Md5ClickId.Value).UpperCase+","+EncodeHex(Md5UnsubId.Value)+","+EncodeHex(Md5OpenId.Value)+","+ImageId+Email+",") else Tout.Write( MD5ClickIDHex ) Tout.Write( "," ) Tout.Write( MD5UnsubIDHex ) Tout.Write( "," ) Tout.Write( MD5OpenIDHex ) Tout.Write( "," ) Dim ImgId As String = Join(ImageId,"") Tout.Write( ImgId ) Tout.Write( Email ) Tout.WriteLine(",") //Tout.WriteLine(EncodeHex(Md5ClickId.Value).UpperCase+","+EncodeHex(Md5UnsubId.Value)+","+EncodeHex(Md5OpenId.Value)+","+ImageId+Email+",") end Dim ClickIdQuery() As String //db.SQLExecute("INSERT INTO cmpsystem.urls VALUES('"+EncodeHex(Md5ClickId.Value)+"','"+ClickUrl+"','"+Email+"','click');") ClickIDQuery.Append("INSERT INTO cmpsystem.urls VALUES('") ClickIDQuery.Append( MD5ClickIDHex ) ClickIDQuery.Append("','") ClickIDQuery.Append( ClickUrl ) ClickIDQuery.Append("','") ClickIDQuery.Append( Email ) ClickIDQuery.Append("','click');") Dim ClickQuery As String = Join(ClickIdQuery,"") db.SQLExecute( ClickQuery ) Dim UnsubIDQuery() As String //db.SQLExecute("INSERT INTO cmpsystem.urls VALUES('"+EncodeHex(Md5UnsubId.Value)+"','"+UnsubUrl+"','"+Email+"','unsub');") UnsubIDQuery.Append("INSERT INTO cmpsystem.urls VALUES('") UnsubIDQuery.Append( MD5UnsubIDHex ) UnsubIDQuery.Append( "','") UnsubIDQuery.Append( UnsubUrl ) UnsubIDQuery.Append( "','") UnsubIDQuery.Append( Email ) UnsubIDQuery.Append("','unsub');") Dim UnsubQuery As String = Join(UnsubIDQuery,"") db.SQLExecute( UnsubQuery ) Dim OpenIDQuery() As String //db.SQLExecute("INSERT INTO cmpsystem.urls VALUES('"+EncodeHex(Md5OpenId.Value)+"','open','"+Email+"','open');") OpenIDQuery.Append("INSERT INTO cmpsystem.urls VALUES('") OpenIDQuery.Append( MD5OpenIdHex ) OpenIDQuery.Append( "','") OpenIDQuery.Append("open") OpenIDQuery.Append("','") OpenIDQuery.Append(Email) OpenIDQuery.Append("','open');") Dim OpenQuery As String = Join(OpenIdQuery, "") db.SQLExecute( OpenQuery ) Md5ClickId = Nil Md5UnsubId = Nil Md5OpenId = Nil Next db.SQLExecute("COMMIT;") if RandomizeLists.Value = True Then Lists.Shuffle() for each List As String in Lists Tout.WriteLine(List) next end else MsgBox db.ErrorMessage end Top taylor-design Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 10:07 pm Joined: Wed Mar 22, 2006 11:15 am Posts: 594 Location: Southern California What is the purpose of all the MD5 ID's you are generating? They include a random string generated in this function which you do not appear to be storing for later use (unless I missed something), so it can't be for checking data integrity at a later time. _________________ Daniel L. Taylor Custom Controls for Real Studio WE! Visit: http://www.webcustomcontrols.com/ Top Zack.H Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 10:19 pm Joined: Sun Jun 15, 2008 1:50 am Posts: 79 taylor-design wrote:What is the purpose of all the MD5 ID's you are generating? They include a random string generated in this function which you do not appear to be storing for later use (unless I missed something), so it can't be for checking data integrity at a later time. It's not being used as a traditional MD5 would be for data integrity verification, I'm just aware that there's less of a chance of collision with md5 compared to something like crc32. But what it's used for is unique ID's to reference the stored data by externally without supplying the requester with the data contained as it's only meant to be used internally within the app it's self once the record is processed where as the Unique ID is supplied to others. Which is why it's stored in a database with additional information in order to have that additional information referenced by it, it's also stored in the file which the next application written in C++ that reads that file then uses those MD5 ID's for it's own internal processing. Anyway, regardless if you'd like to supply a different method of generating unique ID's which you believe would be a faster implementation with less delay per record essentially making the whole process faster in general then please do as long as it's efficient and has a low probability of ever colliding it's very important that these values are unique. Update: I guess everything else aside this thread actually belongs in the db section if anywhere else unless I want faster then 15 second processing speed reading/generating/writing the records as that's what it currently accomplishes without my db queries. Top taylor-design Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 11:04 pm Joined: Wed Mar 22, 2006 11:15 am Posts: 594 Location: Southern California The performance of MD5 depends entirely on what you feed it. If I feed it 500k of data it takes roughly 1,000 microseconds to generate its output. If I feed it 70k, about 300 microseconds. You can generate random IDs using an empty REALSQLDatabase. Here's the sample function: Function GUID() As String Dim db As New REALSQLDatabase If db.Connect Then Return db.SQLSelect("select hex( randomblob(4)) " _ + "|| '-' || hex( randomblob(2)) " _ + "|| '-' || '4' || substr( hex( randomblob(2)), 2) " _ + "|| '-' || substr('AB89', 1 + (abs(random()) % 4) , 1) " _ + "|| substr(hex(randomblob(2)), 2) " _ + "|| '-' || hex(randomblob(6)) AS GUID").Field("GUID") Return "ERROR" End Function oleman108 posted this in the thread: http://forums.realsoftware.com/viewtopic.php?f=1&t=44203&hilit=+UUID. The chance of an ID collision is extremely low, lower then the way you're using MD5. In your case I would store the connected REALSQLDatabase instead of generating it on each call. When I do this I can generate unique IDs in roughly 200 microseconds. This is faster then my own UUID module http://taylor-design.com/downloads/uuid.zip, and I've started using it for this reason. If these are real emails with real images of any appreciable count or size, this should be much faster then using MD5. _________________ Daniel L. Taylor Custom Controls for Real Studio WE! Visit: http://www.webcustomcontrols.com/ Top Zack.H Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 11:11 pm Joined: Sun Jun 15, 2008 1:50 am Posts: 79 taylor-design wrote:The performance of MD5 depends entirely on what you feed it. If I feed it 500k of data it takes roughly 1,000 microseconds to generate its output. If I feed it 70k, about 300 microseconds. You can generate random IDs using an empty REALSQLDatabase. Here's the sample function: Function GUID() As String Dim db As New REALSQLDatabase If db.Connect Then Return db.SQLSelect("select hex( randomblob(4)) " _ + "|| '-' || hex( randomblob(2)) " _ + "|| '-' || '4' || substr( hex( randomblob(2)), 2) " _ + "|| '-' || substr('AB89', 1 + (abs(random()) % 4) , 1) " _ + "|| substr(hex(randomblob(2)), 2) " _ + "|| '-' || hex(randomblob(6)) AS GUID").Field("GUID") Return "ERROR" End Function oleman108 posted this in the thread: http://forums.realsoftware.com/viewtopic.php?f=1&t=44203&hilit=+UUID. The chance of an ID collision is extremely low, lower then the way you're using MD5. In your case I would store the connected REALSQLDatabase instead of generating it on each call. When I do this I can generate unique IDs in roughly 200 microseconds. This is faster then my own UUID module http://taylor-design.com/downloads/uuid.zip, and I've started using it for this reason. If these are real emails with real images of any appreciable count or size, this should be much faster then using MD5. I don't know what the images/their size has to do with the MD5 functionality although I do appreciate the suggestion that actually looks like a WAY better implementation then what I had before, now to solve my 59min long SQL Query. -_-. Top taylor-design Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 11:16 pm Joined: Wed Mar 22, 2006 11:15 am Posts: 594 Location: Southern California Read your two messages that the main problem is SQL. What happens if you modify the method so that it only calls SQLExecute once per loop? Append all your queries into a single variable and issue one SQLExecute call at the end of the loop. If possible, test this on a subset of the data since I have no idea if it will speed up or slow down the time (which stands at 1hr). Edit - my hunch is that it will speed things up. But I can't directly test your code sample, so I can't say for sure. I have seen this speed things up in the past. If this improves performance you should see it with this test. If so, at that point I would redesign the method to send an arbitrary number of inserts (i.e. controlled by a constant) at once and test to find the optimum number. You may find that sending all the inserts for 100 records or 1,000 is far faster then multiple SQLExecute calls for each record. Another option might be to write out a file specifically for import into your database. Database engines typically have functions for importing text files to tables and they are optimized for quickly importing massive record counts. _________________ Daniel L. Taylor Custom Controls for Real Studio WE! Visit: http://www.webcustomcontrols.com/ Last edited by taylor-design on Mon Feb 25, 2013 11:22 pm, edited 1 time in total. Top taylor-design Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 11:21 pm Joined: Wed Mar 22, 2006 11:15 am Posts: 594 Location: Southern California Zack.H wrote:I don't know what the images/their size has to do with the MD5 functionality although I do appreciate the suggestion that actually looks like a WAY better implementation then what I had before, now to solve my 59min long SQL Query. -_-. I assumed that in... for each Image As String in Images Dim Md5ImgId As Md5Digest = new Md5Digest Md5ImgId.Process Email Md5ImgId.Process Rand Md5ImgId.Process Image Image actually holds raw data for an image. Again, total data size is what matters for MD5. It's fast, but the more data, the more time it takes. So a couple KB of data probably won't take any longer then getting the ID from RealSQLDatabase. But a couple hundred KB and all of a sudden the other option is considerably faster. Of course, this isn't the main issue at the moment. Try the SQL test I suggested above. _________________ Daniel L. Taylor Custom Controls for Real Studio WE! Visit: http://www.webcustomcontrols.com/ Top Zack.H Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 11:30 pm Joined: Sun Jun 15, 2008 1:50 am Posts: 79 taylor-design wrote:Zack.H wrote:I don't know what the images/their size has to do with the MD5 functionality although I do appreciate the suggestion that actually looks like a WAY better implementation then what I had before, now to solve my 59min long SQL Query. -_-. I assumed that in... for each Image As String in Images Dim Md5ImgId As Md5Digest = new Md5Digest Md5ImgId.Process Email Md5ImgId.Process Rand Md5ImgId.Process Image Image actually holds raw data for an image. Again, total data size is what matters for MD5. It's fast, but the more data, the more time it takes. So a couple KB of data probably won't take any longer then getting the ID from RealSQLDatabase. But a couple hundred KB and all of a sudden the other option is considerably faster. Of course, this isn't the main issue at the moment. Try the SQL test I suggested above. Yeah, That's not the case I should've been more specific on that point... It's actually just the URL to the image which is being processed and is later referenced by this ID and a combination of PHP I.E. <?php $url = get_url_byid( RANDOMID_HERE ); header("Location: $url"); ?> I'll go ahead and test the method which you said and I'm gonna integrate the Unique ID generation system you mentioned... I noticed this is specifically something with SQLite though, where as I've just migrated my data to postgres in hopes it would be faster as the previous SQLite queries were actually way slower then the current postgres ones ( Where it currently took 1hr to process 700,000 records with those queries, when using SQLite it took a matter of 5+ hours it seemed ). And as said previously it took at most 10 seconds to process the 700,000 records without any db insertion statements. Edit: As far as writing the data out to a file for importing I'm actually quite new to postgres myself and I just know it has the ability to handle massive amounts of data with no trouble. Importing from a flat file into it however using some kinda LOAD TEXT or other method that MySQL would have, I'm not quite sure it even has that functionality but I'll have to do the research on such once I do some testing on just using queries first. Top taylor-design Post subject: Re: Achieving the best speed possible?Posted: Mon Feb 25, 2013 11:38 pm Joined: Wed Mar 22, 2006 11:15 am Posts: 594 Location: Southern California Zack.H wrote:I'll go ahead and test the method which you said and I'm gonna integrate the Unique ID generation system you mentioned... I would test one at a time. Do the SQL first. Quote:I noticed this is specifically something with SQLite though, where as I've just migrated my data to postgres in hopes it would be faster as the previous SQLite queries were actually way slower then the current postgres ones ( Where it currently took 1hr to process 700,000 records with those queries, when using SQLite it took a matter of 5+ hours it seemed ). When you test this part, just create/use a RealSQLDatabase for the guid generation. As you can see from the sample function, the RealSQLDatabase doesn't have to have any tables or be setup with a file on disk. Don't migrate back from Postgres if it's faster. The two don't have to have anything to do with each other. I wouldn't recommend trying to get the guid from Postgres simply because the overhead in the connection will likely make it slower then RealSQLDatabase for the guid generation. _________________ Daniel L. Taylor Custom Controls for Real Studio WE! Visit: http://www.webcustomcontrols.com/ Top Zack.H Post subject: Re: Achieving the best speed possible?Posted: Tue Feb 26, 2013 12:26 am Joined: Sun Jun 15, 2008 1:50 am Posts: 79 taylor-design wrote:Zack.H wrote:I'll go ahead and test the method which you said and I'm gonna integrate the Unique ID generation system you mentioned... I would test one at a time. Do the SQL first. Quote:I noticed this is specifically something with SQLite though, where as I've just migrated my data to postgres in hopes it would be faster as the previous SQLite queries were actually way slower then the current postgres ones ( Where it currently took 1hr to process 700,000 records with those queries, when using SQLite it took a matter of 5+ hours it seemed ). When you test this part, just create/use a RealSQLDatabase for the guid generation. As you can see from the sample function, the RealSQLDatabase doesn't have to have any tables or be setup with a file on disk. Don't migrate back from Postgres if it's faster. The two don't have to have anything to do with each other. I wouldn't recommend trying to get the guid from Postgres simply because the overhead in the connection will likely make it slower then RealSQLDatabase for the guid generation. I planned on doing it one at a time, I wasn't thinking of using postgres for the UID generation I was just more of curious as to if you noticed I was using it and not SQLite . Anyway, tested out the queries put together in a single one and tested it out... Result 30k records every 5mins or so from my timing which is still slow I'm trying to get the full 700k in at least 1-5mins which I know is possible I guess I have some database optimization to do. Another idea I had which one of my buddies did with PHP when he ran an MD5 Collision database was he decided that storing 2million records in a MySQL db just wasn't worth it so he came up with the concept of storing each one of his plain-texts to the md5's into a text file (md5hash).txt. So when someone submitted an MD5 all it had to do was open the single file by an already determined name, keep in mind this was when we were both like 16 so I didn't know much about DB Optimization and nor did he. But at this point building portable easy to install code that seems like a viable solution just not very practical, Have you ever worked with Postgres do you have any additional suggestions? Top taylor-design Post subject: Re: Achieving the best speed possible?Posted: Tue Feb 26, 2013 12:45 am Joined: Wed Mar 22, 2006 11:15 am Posts: 594 Location: Southern California Quote:Anyway, tested out the queries put together in a single one and tested it out... Result 30k records every 5mins or so from my timing which is still slow I'm trying to get the full 700k in at least 1-5mins which I know is possible I guess I have some database optimization to do. So you tested a single SQLExecute per loop? Is this faster or slower then processing the same subset of data the old way? (It sounds slower, but your measurement sounds like an estimate so I don't know.) Quote:But at this point building portable easy to install code that seems like a viable solution just not very practical, Have you ever worked with Postgres do you have any additional suggestions? I work with Postgres a lot, but it's rare to have bulk data loads like this that will be repeated under time constraints. Perhaps try COPY: http://www.postgresql.org/docs/9.2/static/sql-copy.html _________________ Daniel L. Taylor Custom Controls for Real Studio WE! Visit: http://www.webcustomcontrols.com/ Top Display posts from previous: All posts1 day7 days2 weeks1 month3 months6 months1 year Sort by AuthorPost timeSubject AscendingDescending Page 1 of 1 [ 11 posts ] -- Over 1500 classes with 29000 functions in one REALbasic plug-in collection. The Monkeybread Software Realbasic Plugin v9.3. http://www.monkeybreadsoftware.de/realbasic/plugins.shtml [email protected]
