The first bottleneck is open/close for each record, which are time consuming
operation. Why don't you just use one connection?
The second one is do the condition check in script instead of doing all within
database server.
Try to fix this two issues. If it is still too slow, we can do more tuning.
________________________________
From: Brandon Phelps<bphe...@gls.com>
To: tiger peng<tigerpeng2...@yahoo.com>
Cc: Tim Bunce<tim.bu...@pobox.com>; John Scoles<byter...@hotmail.com>;
"dbi-users@perl.org"<dbi-users@perl.org>
Sent: Wednesday, September 21, 2011 9:04 AM
Subject: Re: Tail Module + DBI Module, can\'t keep up!
Thanks for the responses.
I would love to be able to output the data to a flat file or something and do
bulk inserts daily, however this isn't really feasible for my situation because
the database operations depend on other records.
Basically the system I am creating works like this currently:
1. Syslog-ng accepts syslog messages from a sonicwall firewall on the network,
and logs to a file in /var/log
3. Perl script runs as a daemon, extracting open_dt, close_dt, protocol,
source_ip, destination_ip, source_port, destination_port, bytes_sent, and
bytes_received
2. Database table called sonicwall_connections, partitioned by the open_dt
column, 1 partition per day
4. Perl script from (3) constantly reads in the /var/log files using the Tail
module. It does the following:
a. When a connection is opened, it INSERTs into the sonicwall_connections
table
b. When a connection is closed, it SELECTs from the sonicwall_connection
table the last record id that matches the protocol, source_ip, source_port, and
destination_port
c. If a record exists matching this criteria, it UPDATEs that record's
close_dt column with the time the connection was closed
d. If a record does not exist, then in our case this means that the
connection was denied due to firewall rules, and we instead INSERT into a
different table, sonicwall_denied
Previously I was doing the entire SELECT prepare and execute in my never ending
while loop that parsed the log. I have since changed this per the
recommendations here. Now, I open my connection to the database using
Perl::DBI, then I create my prepared statement (outside the while loop), and I
simply call execute() within the while loop, so the prepared statement only has
to be created once. This seems to have helped a small bit in that the script
does not fall behind quite as fast now, however it does still fall behind.
Currently with these modifications it is 10AM EST and the script is still
working on log entries from yesterday night. Around 12 hours behind, compared
to the 22-23 hours behind it was previously, so this change obviously did help
a bit... still not perfect yet though.
As you can see from the procedure above I cannot really work with a flat file
or non-indexed hot table because I constantly need to read back in records that
were previously inserted. Sometimes the record I need was inserted just
seconds before, while other times I need to retrieve records that have been
there for days. For example, when a user loads a website in their browser, I
get my initial open connection log entry and I do the initial INSERT, but only
a second later (depending on the size of the site obviously) I am ready to
UPDATE the record with the close_dt time (when the web request connection was
closed). On the other hand if a user downloads a 10GB file via FTP, or starts
a remote desktop session, the connection could potentially be opened for hours
if not days, so I also need to be able to read records from that long ago, in
order to get the record ID to UPDATE when the connection is closed.
Another reason I cannot store records to a temporary table or flat file is because the
actual web gui users most often need the data right away. For example if a manager sees
a huge spike in bandwidth on our fiber connection, they need to be able to instantly open
the web page and get the exact details of that connection, such as the source ip
address... this way they can determine what employee has initiated the download or
whatever, and to what destination they are connected. This way they can say, "Oh
this employee has a VNC connection opened to their house, that doesn't really seem work
related", etc.
On 09/16/2011 12:42 PM, tiger peng wrote:
I just finished tuning an application almost exactly reverse Brandon's, not
parsing text file and then loading, but extracting, parsing/transforming the
data, and then generate XML file.
Original application did fetch and parsing record one-by-one, as it the code
for parsing/transforming is no implemented efficient. It took around 40 minutes
to pull data from a highly active transaction database, the DBAs complained
about it.I split the extracting, let it quickly dump the data to a flat file (
~8 minutes) and then parsing/transforming the data and generate the XML file
(~3 minutes). It is possible to speed it up even more by (parallel) pulling the
data into memory (not very big, ~2GB) or dumping the flat data file on DB
server side.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------
--
*From:* tiger peng<tigerpeng2...@yahoo.com>
*To:* Tim Bunce<tim.bu...@pobox.com>; John Scoles<byter...@hotmail.com>
*Cc:* "bphe...@gls.com"<bphe...@gls.com>;
"dbi-users@perl.org"<dbi-users@perl.org>
*Sent:* Friday, September 16, 2011 10:13 AM
*Subject:* Re: Tail Module + DBI Module, can\'t keep up!
Tim,
I bet at that time your MySQL did not support partition.
Now, with partition management available, the jobs should be easy.
Using a small partition, called hot partition in DW/ETL field, to receive the
new data, as there is only small chunk of data the insert/update DML should be
executed fast (keep index locally).When it reach a threshold, split the hot
partition, use the fresh hot partition to receive newer data, and merge the
older data to the main partition.
Using of text file has its advantage: When ever the DB is down, the parsing of
log files can still run by itself; and when the DB is back, bulk-load tool can
be used to catch up the load quickly.
Tiger
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------
--
*From:* Tim Bunce<tim.bu...@pobox.com>
*To:* John Scoles<byter...@hotmail.com>
*Cc:* tigerpeng2...@yahoo.com; bphe...@gls.com; dbi-users@perl.org
*Sent:* Friday, September 16, 2011 4:20 AM
*Subject:* Re: Tail Module + DBI Module, can\'t keep up!
This is all fine advice for performance tuning DBI app and worth doing.
But there's always a limit to what can be achieved on the client.
So it's worth exploring what can be done on the server side, beyond
standard tuning practices.
I wrote a high volume log insertion app using mysql many years ago.
I can't remember the performance figures, but it was fast.
A major cost of insertion is maintaining indices. So for maximum
insertion speed you want to have no indices. But to query the log
records you'll probably want indices.
The way I approached this was to have per-hour tables: log_YYYYMMDDHH
The loader streams the records into the table as fast as it can.
(From memory I used a multi-row INSERT statement, no placeholders.)
Every time it switches to a new hour it triggered the building of
indices on the previous hour's table. It also triggered recreating
per-day views (log_YYYYMMDD) that abstracted the hourly tables.
The same technique could be applied at whatever time granularity meets
your needs, from table-per day to table-per minute.
If you can delay the loading of the log data by whatever period you're
using (eg hourly) then you have the option of preparing the log data as
hourly text files and then, when each is complete, using a proprietary
bulk-loading tool to load it.
Tim.