Thanks for letting me know how the indexes work. I will play with it some more by changing the order of indexes/changing the select statements and see what the optimizer does.
Sai Pullabhotla Phone: (402) 408-5753 Fax: (402) 408-6861 www.jMethods.com On Thu, Dec 25, 2008 at 6:32 PM, <[email protected]> wrote: > > Sigh. > > > Sai, > > You post a question to this mailing list looking for help. > The help you are getting is free so take it with a grain of salt. > But also consider that the people who are taking the time to help you are > seasoned database professionals who have a lot of expertise. > > You are trying to rationalize a bad design. We know that it is a bad design > because you're here asking for help because your application does not > perform well. > > I would suggest that you consider what is being said and then taking it back > to your team so you can fix your design. > > Not having indexes is bad. Fixing this doesn't fix your poor design. It will > improve some performance issues, but you'll not get the most from your > database. > > I understand that you're trying to store your log data in to a database so > that you can take advantage of what a database does best. > > But if you're going to take the time and effort to do this, why don't you do > it right from the start and do the little extra work to make your life > simpler in the long run. > > There's an expression... '... there is more than one way to skin a cat ...'. > This is true when it comes to solving your problems in software. > > If you add a column to your user table a emp_status char(1), you can store > values like 'A' for active, or 'N' for non active. Or if your database > supports binary data types, True or False for active or non-active. > > You can then have your program get the correct user id based on username and > emp_status. You then use this information when you insert the record in to > your job table. This will 'shrink' the width of the columns. (I'm not going > to suggest alternatives like triggers and stored procedures but they also > work well...) > > You can then do the same for your project table. > > This should clean up your table design. > > Getting in to your questions about indexes... > > First, lets take a look at your query: >> select job_id, proj_id, proj_name, proj_folder, submit_user, >> run_user, queue_priority, run_priority, submit_time, start_time, >> end_time, time_taken, job_log, status from dpa_job where start_time>=? >> and start_time<? and status in(?, ?, ?, ?) order by start_time desc >> > > I rewrote it in the following format. > > SELECT job_id, proj_id, proj_name, proj_folder, > submit_user, run_user, queue_priority, > run_priority, submit_time, start_time, end_time, > time_taken, job_log, status > FROM dpa_job > WHERE start_time>=? > AND start_time<? > AND status IN(?, ?, ?, ?) > ORDER BY start_time desc > > The reason you want to do this is that you now have visual cues to look at > your query and see what is going on. It doesn't take much more time and if > you make this a habit, you'll get even faster at spotting problems. > > The SELECT clause shows you the columns that you're displaying. > The FROM clause shows you where the data is coming from. > The WHERE clause indicate what filters are being used to limit the data. > (Its really the WHERE, AND, and OR statements.) > This is where you need to focus your attention when you want to create > indexes that your table will use. > > In this query, you are filtering on only two columns. So if you want to > improve the query's performance, you want an index that is ordered on one or > both of these columns. Since start_time is more unique than status, you'll > want the index to order first on start_time, and then on status. You may > want to not use status in the compound index unless you really have a lot of > different status types. (It looks like you only have a handful of status > types. > > If you have other queries, you'd want to perform the same exercise and see > how the queries filter the datasets from each table and then see if you > already have an index that could be used. If not, then you'd want to add an > index. > > An example... > Suppose you want to find all of the jobs submitted for a given project by a > given user. > > Your WHERE clause would look like: > WHERE proj_id = ? > AND submit_user = ? > > If you want to limit the time stamp range you'd add the following: > AND start_time >=? > AND start_time <=? > > You could then index on proj_id, submit_user, and possibly start_time in > that order. > > (Since you may have an index on start_time, you may want to check to see > which index the optimizer is going to use. It is possible that the optimizer > uses the index on start_time and then filters the data set by the proj_id > and submit_user. This is going to be database dependent.) > > Since you have a database and data, you can use ij to create and drop > indexes and test the queries to see how they perform. > > I would suggest that you look at a good relational database design book to > see how to improve performance and how relational databases work. > > Way back when, I used C J Date's book in school. > The text I used is way out of print, but here's a list of CJ Date's books > that can be found on Amazon: > http://www.amazon.com/s/qid=1230251379/ref=sr_pg_2?ie=UTF8&rs=1000&sort=date > rank&keywords=Relational%20Database&rh=i%3Astripbooks%2Ck%3ARelational%20Dat > abase%2Cp_27%3AC.%20J.%20Date&page=1 > > This may be a good start: > http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596 > 100124/ref=sr_1_6?ie=UTF8&s=books&qid=1230251480&sr=1-6 > > HTH > > -Mike > >> -----Original Message----- >> From: Sai Pullabhotla [mailto:[email protected]] >> Sent: Thursday, December 25, 2008 11:06 AM >> To: Derby Discussion >> Subject: Re: Need Advice on Indexes >> >> Mike, >> >> We went through all of these during the design phase and decided that >> we do not have to do the kind of normalization you have mentioned. >> Below are a couple of reasons: >> >> As stated before, my jobs table is nothing more than a plain LOG file >> which is stored in a database table instead of a regular file on the >> file system. The only reason why we needed to store it in the database >> is for easy search/filtering of records. If I were to use a plain log >> file, I don't think I would want to print a user ID (a number) to a >> log file which does not make much sense to the users reading it. I >> guess the same holds true regardless of the underlying data store >> (plain file or database table or something else). >> >> About not really deleting a user record and having a "status" field in >> the user table, we did go through this and here are the issues with >> this approach: >> Let us say my table definition is - >> CREATE TABLE user( >> user_id integer primary key, >> user_name char(20), >> full_name varchar(50)); >> >> We do want to have a unique constraint on the user_name field. If we >> do not purge the user record when a delete is requested, then I cannot >> have a new user with the same name as a user that was deleted 2 years >> ago. Same is true even if you create a composite unique key with >> user_name and status fields. Correct me if I'm wrong. >> >> You said that - >> >> > Based on this query, your index would never be used. >> >> Why not and how do I make Derby to use my index in various scenarios >> I've mentioned. >> >> Thanks. >> Sai Pullabhotla >> Phone: (402) 408-5753 >> Fax: (402) 408-6861 >> www.jMethods.com >> >> >> >> >> On Wed, Dec 24, 2008 at 4:10 PM, <[email protected]> wrote: >> > Sai, please see my comments in yours... >> > >> >> -----Original Message----- >> >> From: Sai Pullabhotla [mailto:[email protected]] >> >> Sent: Wednesday, December 24, 2008 1:54 PM >> >> To: Derby Discussion; [email protected] >> >> Subject: Re: Need Advice on Indexes >> >> >> >> Rick and Mike, >> >> >> >> Thanks for taking time to reply to my question. >> >> >> >> As far as normalizing the table that Mike suggested - >> >> >> >> this jobs table is nothing more than a history table. So, a project or >> >> user exists today but may not exist tomorrow. However, we want to >> >> maintain the history of every project that was executed (this is >> >> considered a job and hence the name). Since, a user may not exist in a >> >> future time, I can not maintain a foreign key to my user table even >> >> though I've a user table and an ID column in it. Same with projects. I >> >> hope it makes sense now why this table is a flat table. However, I >> >> think I can move the remarks column into its own table and use a >> >> longvarchar or clob field. >> >> >> > Ok, >> > That's not a good design. Your user information should be stored >> separately >> > and have an id associated with it. You don't want to purge this >> information; >> > however you can associate a status field as part of this table. So you >> > retain and don't reuse the ids and you can still keep historical data >> > intact. >> > >> > One of the problems you may have is how you populate the field with the >> > employee's name. What happens if you have two employees with the same >> name? >> > (Yes it happens. ;-) >> > The other problem is what happens when the person entering the name >> > misspells the name? (Segel != Segal,Siegel,Siegal,Seagull) :-) >> > >> > In addition, your table isn't normalized when you clearly have a 1 to >> many >> > relationship of data. Conceptually there are multiple jobs to a project. >> So >> > when you store jobs and projects in the same data, its not normalized. >> > >> > Note: There are times when you break from using 3rd normal form, however >> it >> > requires a lot of practical experience to justify it. >> > >> > >> >> Now, as far as the queries go, Most of the times, it is - >> >> >> >> select job_id, proj_id, proj_name, proj_folder, submit_user, >> >> run_user, queue_priority, run_priority, submit_time, start_time, >> >> end_time, time_taken, job_log, status from dpa_job where start_time>=? >> >> and start_time<? and status in(?, ?, ?, ?) order by start_time desc >> >> >> > Ok, >> > Not to nitpick, but I'm a bit old school and when writing a query I tend >> to >> > want to see it formatted to be easier to read. Taking your query... >> > >> > SELECT job_id, proj_id, proj_name, proj_folder, >> > submit_user, run_user, queue_priority, >> > run_priority, submit_time, start_time, end_time, >> > time_taken, job_log, status >> > FROM dpa_job >> > WHERE start_time>=? >> > AND start_time<? >> > AND status IN(?, ?, ?, ?) >> > ORDER BY start_time desc >> > >> > It's a bit easier to read. The key words are capitalized and you can see >> > your queries. >> > >> > Now looking at your common query example, the only thing you need to >> index >> > on is the start time column. You can add status to the query, however >> its >> > not going to really impact the performance of the query... unless you >> really >> > have a lot of rows. (Like more than 10,000 rows in the result set if you >> > ignore the status filter.) >> > >> > Does the IN clause effect the performance? There was an issue earlier in >> > Derby, but I think it has been fixed for a couple of versions. In your >> > example query with only 4 values, it is not that big of an issue. >> > >> > Based on this query, your index would never be used. >> > >> > You've indicated that this is a production database with customers using >> it. >> > I'm going to strongly suggest that your company hire a competent DBA. >> You've >> > said a couple of things that are major red flags. When I say DBA, I >> mean >> > someone who could be considered a logical DBA and has both data modeling >> and >> > app development expertise. If not a FTE, get a consultant in to do a >> health >> > check. >> > >> > >> > >> >> Sai Pullabhotla >> >> Phone: (402) 408-5753 >> >> Fax: (402) 408-6861 >> >> www.jMethods.com >> >> >> >> >> >> >> >> >> > >> > >> > >> > > > >
