Shawn,
Thanks for helping on this. I really appreciate it.
No problem!!
Please post the structures of your "big_table" and your "url_table" (whatever you called them) and I will help you to rewrite step 4 to count how many times a URL appears in the "big_table".
mysql> desc internet_usage; +--------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+-------------+------+-----+---------+-------+ | uid | varchar(10) | YES | MUL | NULL | | | time | datetime | YES | | NULL | | | ip | varchar(20) | YES | | NULL | | | action | varchar(20) | YES | | NULL | | | urlid | int(11) | YES | | NULL | | | size | int(11) | YES | | NULL | | +--------+-------------+------+-----+---------+-------+ 6 rows in set (0.03 sec)
mysql> desc url_table; +----------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+--------------+------+-----+---------+----------------+ | urlid | mediumint(9) | | PRI | NULL | auto_increment | | url | text | YES | MUL | NULL | | | hits | mediumint(9) | YES | | NULL | | | category | varchar(50) | YES | | NULL | | +----------+--------------+------+-----+---------+----------------+ 4 rows in set (0.01 sec)
mysql>
Is this bulk import a process you need to do repeatedly? if so we need to worry about updating the count column on the next batch import and not just re-creating it from the new data. That will change the query significantly.
Yes it will be a repeated process. Actually I have a backlog of 6 months data to load!
Here's an example of what the data looks like:
mysql> select * from internet_usage limit 5;
+---------+---------------------+----------------+--------------+-------+------+
| uid | time | ip | action | urlid | size |
+---------+---------------------+----------------+--------------+-------+------+
| n58396 | 2004-06-07 21:12:16 | 21.38.25.204 | TCP_TUNNELED | 5999 | 5297 |
| u344584 | 2004-06-07 21:07:12 | 21.33.136.74 | TCP_HIT | 4494 | 438 |
| - | 2004-06-07 21:07:02 | 21.38.92.76 | TCP_NC_MISS | 2134 | 771 |
| u524797 | 2004-06-07 21:03:27 | 21.32.25.41 | TCP_NC_MISS | 260 | 582 |
| - | 2004-06-07 21:09:13 | 21.201.130.240 | TCP_HIT | 3112 | 542 |
+---------+---------------------+----------------+--------------+-------+------+
mysql> select * from url_table limit 5;
+-------+-----------------------------------------------------------------------------------+------+---------------+
| urlid | url | hits | category |
+-------+-----------------------------------------------------------------------------------+------+---------------+
| 1 | http://www.bbc.co.uk/horoscopes/chinesehoroscopes/images/hp-snake.gif | NULL | Entertainment |
| 2 | http://www.call18866.co.uk/images/logo.jpg | NULL | none |
| 3 | http://m2.doubleclick.net/866421/0409_santoku_250.gif | NULL | none |
| 4 | http://lysto1-dc02.ww.ad.ba.com/ | NULL | Travel |
| 5 | http://www.aboutscotland.com/edin/newstreetph/sitview.jpg | NULL | Travel |
+-------+-----------------------------------------------------------------------------------+------+---------------+
5 rows in set (0.00 sec)
One other problem I'm having here is making the rows in internet_usage unique. At the moment I have lots of duplicates, and I was trying to create a temporary table but unfortunately got an error 27 (I think this refers to a 2GB limit).
mysql> CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM internet_usage;
ERROR 1030: Got error 27 from table handler
Is there another way of doing this?
Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine
> >Mos forgot to populate the url_id column in your user table. I would use >his same process but re-arrange it like this: > >1) create table BIG_TABLE.... >2) load data infile.... >3) create table URL_TABLE ( > url_id bigint not null auto_increment, > url varchar(25) not null primary key, > category .... > ) >4) INSERT IGNORE URL_TABLE (url, category) > SELECT url,category > FROM BIG_TABLE >4) create table USER_TABLE ( > user_id varchar?(...) primary key, > date_time datetime, > url_id bigint, > size int > ) > >5) insert USER_TABLE > SELECT bt.userID, bt.datetime, u.url_id, bt.size > FROM BIG_TABLE bt > INNER JOIN URL_TABLE u > ON u.url = bt.url > >doing it this way lets the SQL engine handle the conversion of URLs to >their IDs in the USER_TABLE...see? No scripting required at all! > >Yours, >Shawn Green >Database Administrator >Unimin Corporation - Spruce Pine > >
Shawn,
Many thanks for your reply. It sounds like this might be quicker than my perl script which parses a 1GB log file and fills the database in 40 mins at the mo. (The actual parsing only takes 1.5 mins). There's one snag which I forgot about and that's in the url_table I have another column called hits which is the number of hits for each url. I'm terrible at SQL and wondered if you might be able to suggest a way of doing
this with the method above?
Thanks,
JS.
_________________________________________________________________ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]
_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]