By my understanding, for processing one record, you have multiple round-trips between Perl process and DB: get data from DB, check the if the data match, and send DML command to DB.
I am not familiar with MySQL anymore, but if I have to handle all these DML processing quickly on Oracle, I will just one DML ("insert all", "merge" statements on Oracle database), so I only do one execute and the DB server will do all the job. Just check online, MySQL has "insert into ... duplicated key" similar to Oracle's merge but I could not find MySQL statement similar to Oracle's "insert all", but you can write a (stored) procedure do all the DML, cannot you? ________________________________ From: Brandon Phelps <bphe...@gls.com> To: tiger peng <tigerpeng2...@yahoo.com> Cc: Tim Bunce <tim.bu...@pobox.com>; John Scoles <byter...@hotmail.com>; "dbi-users@perl.org" <dbi-users@perl.org> Sent: Wednesday, September 21, 2011 9:49 AM Subject: Re: Tail Module + DBI Module, can\'t keep up! What do yo mean about the condition check in a script? Not sure I follow. On 09/21/2011 10:41 AM, tiger peng wrote: > The first bottleneck is open/close for each record, which are time consuming > operation. Why don't you just use one connection? > The second one is do the condition check in script instead of doing all > within database server. > > Try to fix this two issues. If it is still too slow, we can do more tuning. > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ---------------------- -- > *From:* Brandon Phelps <bphe...@gls.com> > *To:* tiger peng <tigerpeng2...@yahoo.com> > *Cc:* Tim Bunce <tim.bu...@pobox.com>; John Scoles <byter...@hotmail.com>; > "dbi-users@perl.org" <dbi-users@perl.org> > *Sent:* Wednesday, September 21, 2011 9:04 AM > *Subject:* Re: Tail Module + DBI Module, can\'t keep up! > > Thanks for the responses. > > I would love to be able to output the data to a flat file or something and do > bulk inserts daily, however this isn't really feasible for my situation > because the database operations depend on other records. > > Basically the system I am creating works like this currently: > > 1. Syslog-ng accepts syslog messages from a sonicwall firewall on the > network, and logs to a file in /var/log > 3. Perl script runs as a daemon, extracting open_dt, close_dt, protocol, > source_ip, destination_ip, source_port, destination_port, bytes_sent, and > bytes_received > 2. Database table called sonicwall_connections, partitioned by the open_dt > column, 1 partition per day > 4. Perl script from (3) constantly reads in the /var/log files using the Tail > module. It does the following: > a. When a connection is opened, it INSERTs into the sonicwall_connections > table > b. When a connection is closed, it SELECTs from the sonicwall_connection > table the last record id that matches the protocol, source_ip, source_port, > and destination_port > c. If a record exists matching this criteria, it UPDATEs that record's > close_dt column with the time the connection was closed > d. If a record does not exist, then in our case this means that the > connection was denied due to firewall rules, and we instead INSERT into a > different table, sonicwall_denied > > > Previously I was doing the entire SELECT prepare and execute in my never > ending while loop that parsed the log. I have since changed this per the > recommendations here. Now, I open my connection to the database using > Perl::DBI, then I create my prepared statement (outside the while loop), and > I simply call execute() within the while loop, so the prepared statement only > has to be created once. This seems to have helped a small bit in that the > script does not fall behind quite as fast now, however it does still fall > behind. Currently with these modifications it is 10AM EST and the script is > still working on log entries from yesterday night. Around 12 hours behind, > compared to the 22-23 hours behind it was previously, so this change > obviously did help a bit... still not perfect yet though. > > As you can see from the procedure above I cannot really work with a flat file > or non-indexed hot table because I constantly need to read back in records > that were previously inserted. Sometimes the record I need was inserted just > seconds before, while other times I need to retrieve records that have been > there for days. For example, when a user loads a website in their browser, I > get my initial open connection log entry and I do the initial INSERT, but > only a second later (depending on the size of the site obviously) I am ready > to UPDATE the record with the close_dt time (when the web request connection > was closed). On the other hand if a user downloads a 10GB file via FTP, or > starts a remote desktop session, the connection could potentially be opened > for hours if not days, so I also need to be able to read records from that > long ago, in order to get the record ID to UPDATE when the connection is > closed. > > Another reason I cannot store records to a temporary table or flat file is > because the actual web gui users most often need the data right away. For > example if a manager sees a huge spike in bandwidth on our fiber connection, > they need to be able to instantly open the web page and get the exact details > of that connection, such as the source ip address... this way they can > determine what employee has initiated the download or whatever, and to what > destination they are connected. This way they can say, "Oh this employee has > a VNC connection opened to their house, that doesn't really seem work > related", etc. > > On 09/16/2011 12:42 PM, tiger peng wrote: > > I just finished tuning an application almost exactly reverse Brandon's, >not parsing text file and then loading, but extracting, parsing/transforming >the data, and then generate XML file. > > > > Original application did fetch and parsing record one-by-one, as it the >code for parsing/transforming is no implemented efficient. It took around 40 >minutes to pull data from a highly active transaction database, the DBAs >complained about it.I split the extracting, let it quickly dump the data to a >flat file ( ~8 minutes) and then parsing/transforming the data and generate >the XML file (~3 minutes). It is possible to speed it up even more by >(parallel) pulling the data into memory (not very big, ~2GB) or dumping the >flat data file on DB server side. > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------ ---- > -- > > *From:* tiger peng <tigerpeng2...@yahoo.com ><mailto:tigerpeng2...@yahoo.com>> > > *To:* Tim Bunce <tim.bu...@pobox.com <mailto:tim.bu...@pobox.com>>; John >Scoles <byter...@hotmail.com <mailto:byter...@hotmail.com>> > > *Cc:* "bphe...@gls.com <mailto:bphe...@gls.com>" <bphe...@gls.com ><mailto:bphe...@gls.com>>; "dbi-users@perl.org <mailto:dbi-users@perl.org>" ><dbi-users@perl.org <mailto:dbi-users@perl.org>> > > *Sent:* Friday, September 16, 2011 10:13 AM > > *Subject:* Re: Tail Module + DBI Module, can\'t keep up! > > > > Tim, > > > > I bet at that time your MySQL did not support partition. > > > > Now, with partition management available, the jobs should be easy. > > Using a small partition, called hot partition in DW/ETL field, to receive >the new data, as there is only small chunk of data the insert/update DML >should be executed fast (keep index locally).When it reach a threshold, split >the hot partition, use the fresh hot partition to receive newer data, and >merge the older data to the main partition. > > > > Using of text file has its advantage: When ever the DB is down, the >parsing of log files can still run by itself; and when the DB is back, >bulk-load tool can be used to catch up the load quickly. > > > > Tiger > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------ ---- > -- > > *From:* Tim Bunce <tim.bu...@pobox.com <mailto:tim.bu...@pobox.com>> > > *To:* John Scoles <byter...@hotmail.com <mailto:byter...@hotmail.com>> > > *Cc:* tigerpeng2...@yahoo.com <mailto:tigerpeng2...@yahoo.com>; >bphe...@gls.com <mailto:bphe...@gls.com>; dbi-users@perl.org ><mailto:dbi-users@perl.org> > > *Sent:* Friday, September 16, 2011 4:20 AM > > *Subject:* Re: Tail Module + DBI Module, can\'t keep up! > > > > This is all fine advice for performance tuning DBI app and worth doing. > > But there's always a limit to what can be achieved on the client. > > So it's worth exploring what can be done on the server side, beyond > > standard tuning practices. > > > > I wrote a high volume log insertion app using mysql many years ago. > > I can't remember the performance figures, but it was fast. > > > > A major cost of insertion is maintaining indices. So for maximum > > insertion speed you want to have no indices. But to query the log > > records you'll probably want indices. > > > > The way I approached this was to have per-hour tables: log_YYYYMMDDHH > > The loader streams the records into the table as fast as it can. > > (From memory I used a multi-row INSERT statement, no placeholders.) > > Every time it switches to a new hour it triggered the building of > > indices on the previous hour's table. It also triggered recreating > > per-day views (log_YYYYMMDD) that abstracted the hourly tables. > > > > The same technique could be applied at whatever time granularity meets > > your needs, from table-per day to table-per minute. > > > > If you can delay the loading of the log data by whatever period you're > > using (eg hourly) then you have the option of preparing the log data as > > hourly text files and then, when each is complete, using a proprietary > > bulk-loading tool to load it. > > > > Tim. > > > > > > > > > >