Re: Native (GZIP) decompress not faster than builtin
Jens, As your test shows, using a native codec won't make much sense for small files, since the involved JNI overhead will likely out-weight any possible gains. With all the performance improvements in java 5 + 6 its reasonable to ask whether the native implementation does really improve performance. I'd look at it as another option to further squeeze out some more performance if you really need to. - Stefan On Sun, May 10, 2009 at 11:03 AM, Jens Riboe jens.ri...@ribomation.com wrote: Hi, During the past week I decided to use native decompress for a Hadoop job (using 0.20.0). But before implementing it I decided to write a small benchmark just so understand how much faster (better) it was. The result came out as a surprise May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader clinit INFO: Loaded the native-hadoop library May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory clinit INFO: Successfully loaded initialized native-zlib library May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Time of Hadoop decompressor running 'small' job = 0:00:01.684 (1.684 ms/file) Time of Hadoop decompressor running 'large' job = 0:00:10.074 (1007.400 ms/file) Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340 ms/file) Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400 ms/file) Hadoop vs. Vanilla [small]: 125.67% Hadoop vs. Vanilla [large]: 99.80% For a small file, Hadoop native decompress takes 25% longer time to run that Java's built-in GZIPInputStream and for a few megabyte sized file the speed difference is negligible. I wrote a blog post about it which also contains the full source code of the benchmark. http://blog.ribomation.com/2009/05/07/comparison-of-decompress-ways-in-hadoo p/ My questions are: [1] Am I missing some key information for how to correctly use native GZIP compress? I'm using codec pooling by the way. [2] Will native decompress only take off for files larger than 100MB or 1000MB? In my application I'm reading many KB sized gz files from an external source, So I cannot change the compress method nor the file size. [3] Has anybody experienced something similar to my result? Kind regards /jens
Re: large files vs many files
You just can't have many distributed jobs write into the same file without locking/synchronizing these writes. Even with append(). Its not different than using a regular file from multiple processes in this respect. Maybe you need to collect your data in front before processing them in hadoop? Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy sdo...@gmail.com wrote: Would WritableFactories not allow me to open one outputstream and continue to write() and sync() ? Maybe I'm reading into that wrong. Although UUID would be nice, it would still leave me in the problem of having lots of little files instead of a few large files. -sd On Sat, May 9, 2009 at 8:37 AM, jason hadoop jason.had...@gmail.com wrote: You must create unique file names, I don't believe (but I do not know) that the append could will allow multiple writers. Are you writing from within a task, or as an external application writing into hadoop. You may try using UUID, http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of your filename. Without knowing more about your goals, environment and constraints it is hard to offer any more detailed suggestions. You could also have an application aggregate the streams and write out chunks, with one or more writers, one per output file.
Re: Hadoop / MySQL
If you have trouble loading your data into mysql using INSERTs or LOAD DATA, consider that MySQL supports CSV directly using the CSV storage engine. The only thing you have to do is to copy your hadoop produced csv file into the mysql data directory and issue a flush tables command to have mysql flush its caches and pickup the new file. Its very simple and you have the full set of sql commands available just as with innodb or myisam. What you don't get with the csv engine are indexes and foreign keys. Can't have it all, can you? Stefan On Tue, Apr 28, 2009 at 9:23 PM, Bill Habermaas b...@habermaas.us wrote: Excellent discussion. Thank you Todd. You're forgiven for being off topic (at least by me). :) Bill -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Tuesday, April 28, 2009 2:29 PM To: core-user Subject: Re: Hadoop / MySQL Warning: derailing a bit into MySQL discussion below, but I think enough people have similar use cases that it's worth discussing this even though it's gotten off-topic. 2009/4/28 tim robertson timrobertson...@gmail.com So we ended up with 2 DBs - DB1 we insert to, prepare and do batch processing - DB2 serving the read only web app This is a pretty reasonable and common architecture. Depending on your specific setup, instead of flip-flopping between DB1 and DB2, you could actually pull snapshots of MyISAM tables off DB1 and load them onto other machines. As long as you've flushed the tables with a read lock, MyISAM tables are transferrable between machines (eg via rsync). Obviously this can get a bit hairy, but it's a nice trick to consider for this kind of workflow. Why did we end up with this? Because of locking on writes that kill reads as you say... basically you can't insert when a read is happening on myisam as it locks the whole table. This is only true if you have binary logging enabled. Otherwise, myisam supports concurrent inserts with reads. That said, binary logging is necessary if you have any slaves. If you're loading bulk data from the result of a mapreduce job, you might be better off not using replication and simply loading the bulk data to each of the serving replicas individually. Turning off the binary logging will also double your write speed (LOAD DATA writes the entirety of the data to the binary log as well as to the table) InnoDB has row level locking to get around this but in our experience (at the time we had 130million records) it just didn't work either. You're quite likely to be hitting the InnoDB autoincrement lock if you have an autoincrement primary key here. There are fixes for this in MySQL 5.1. The best solution is to avoid autoincrement primary keys and use LOAD DATA for these kind of bulk loads, as others have suggested. We spent €10,000 for the supposed european expert on mysql from their professional services and were unfortunately very disappointed. Seems such large tables are just problematic with mysql. We are now very much looking into Lucene technologies for search and Hadoop for reporting and datamining type operations. SOLR does a lot of what our DB does for us. Yep - oftentimes MySQL is not the correct solution, but other times it can be just what you need. If you already have competencies with MySQL and a good access layer from your serving tier, it's often easier to stick with MySQL than add a new technology into the mix. So with myisam... here is what we learnt: Only very latest mysql versions (beta still I think) support more than 4G memory for indexes (you really really need the index in memory, and where possible the FK for joins in the index too). As far as I know, any 64-bit mysql instance will use more than 4G without trouble. Mysql has differing join strategies between innoDB and myisam, so be aware. I don't think this is true. Joining happens at the MySQL execution layer, which is above the storage engine API. The same join strategies are available for both. For a particular query, InnoDB and MyISAM tables may end up providing a different query plan based on the statistics that are collected, but given properly analyzed tables, the strategies will be the same. This is how MySQL allows inter-storage-engine joins. If one engine is providing a better query plan, you can use query hints to enforce that plan (see STRAIGHT_JOIN and FORCE INDEX for example) An undocumented feature of myisam is you can create memory buffers for single indexes: In the my.cnf: taxon_concept_cache.key_buffer_size=3990M -- for some reason you have to drop a little under 4G then in the DB run: cache index taxon_concept in taxon_concept_cache; load index into cache taxon_concept; This allows for making sure an index gets into memory for sure. But for most use cases and a properly configured machine you're better off letting it use its own caching policies to keep hot indexes in RAM. And here
Re: hadoop job controller
You can get the job progress and completion status through an instance of org.apache.hadoop.mapred.JobClient . If you really want to use perl I guess you still need to write a small java application that talks to perl and JobClient on the other side. Theres also some support for Thrift in the hadoop contrib package, but I'm not sure if it exposes any job client related methods. On Thu, Apr 2, 2009 at 12:46 AM, Elia Mazzawi elia.mazz...@casalemedia.com wrote: I'm writing a perl program to submit jobs to the cluster, then wait for the jobs to finish, and check that they have completed successfully. I have some questions, this shows what is running ./hadoop job -list and this shows the completion ./hadoop job -status job_200903061521_0045 but i want something that just says pass / fail cause with these, i have to check that its done then check that its 100% completed. which must exist since the webapp jobtracker.jsp knows what is what. also a controller like that must have been written many times already, are there any around? Regards, Elia
Re: ANN: Hadoop UI beta
On Tue, Mar 31, 2009 at 1:23 PM, Mikhail Yakshin greycat.na@gmail.com wrote: Couldn't you please explain, what does it do or at least what do you want it to do? Why is it better than default Hadoop web UI? Mikhail. We needed a full featured hdfs file manager for end-users that could be distributed over the web. Its something we haven't found out there (webdav or fuse not being an option for us) and may also be useful for other hadoop users. Its an offspring of a commercial platform we're about to develop, along with the job tracker for internal use and other modules, related and unrelated to hadoop. @W: Thanks :) I haven't tried it with trunk. I'm going to create a custom branch for 0.18 and trunk when theres time for it.. @vishal: Job history is not implemented yet. I haven't yet figured out exactly how to do this. Its on the list.. - Stefan
Re: ANN: Hadoop UI beta
Hi Brian On Tue, Mar 31, 2009 at 3:46 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Stefan, I like it. I would like to hear a bit how the security policies work. If I open this up to the world, how does the world authenticate/authorize with my cluster? Not at all. The daemon part of Hadoop UI is running under a configurable user and will issue calls to Hadoop on behalf of this user. Its not much different from the standard web UI in this domain. The plan is to introduce a authentication layer with one of the next releases. It will be based on Spring Security and thus enables you to use many different authentication providers. So downloading all those Spring libraries along with the project will finally pay off ;) I'd love nothing more to be able to give my users a dead-simple way to move files on and off the cluster. This appears to be a step in the right direction. I'm not familiar with Adobe Flex -- how will this affect other's abilities to use it (i.e., Linux Mac folks?) and how will this affect the ability to contribute (i.e., if you get a new job, are the users of this project screwed?). Gah, I sound like my boss. Theres nothing arcane about Flex, but please don't tell anybody. You can get the recently open sourced (MPL) SDK for any platform supporting Java and compile Hadoop UI using ant. Other libraries used are flexlib (MIT license), Spring (Apache L.), BlazeDS (LGPL). In case I would have to look for a new job, as you suggest, other people would be able to fork as long as they know some Action Script, the actual language used in Flex, and some XML. - Stefan
Re: Join Variation
Have you considered hbase for this particular task? Looks like a simple lookup using the network mask as key would solve your problem. Its also possible to derive the network class (A,B,C) based on the network class of the concerned ip. But I guess your search file will cover ranges in more detail than just on class level. On Tue, Mar 24, 2009 at 12:33 PM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, We need to implement a Join with a between operator instead of an equal. What we are trying to do is search a file for a key where the key falls between two fields in the search file like this: main file (ip, a, b): (80, zz, yy) (125, vv, bb) search file (from-ip, to-ip, d, e): (52, 75, xxx, yyy) (78, 98, aaa, bbb) (99, 115, xxx, ddd) (125, 130, hhh, aaa) (150, 162, qqq, sss) the outcome should be in the form (ip, a, b, d, e): (80, zz, yy, aaa, bbb) (125, vv, bb, eee, hhh) We could convert the ip ranges in the search file to single record ips and then do a regular join, but the number of single ips is huge and this is probably not a good way. What would be a good course for doing this in hadoop ? Thanks, Tamir
csv input format handling and mapping
Hi Can anyone share his experience or solution for the following problem? I'm having to deal with a lot of different file formats, most of them csv. Each of them shares similar semantics, ie. fields in file A exists in file B as well. What I'm not sure of is the exact index of the field in the csv file. Fields in file A may also have different names for the same thing as in file B. Simplified example: affiliateA.csv: Date; Clicks; Views; Orders 2009-03-10; 10; 20; 4 affiliateB.csv Date; Orders; Impressions; Clicks 13/03/09; 40; 2000; 1000 Possible mapping file: field-mapping field id=date type=java.util.Date/ field id=clicks type=java.lang.Integer/ field id=views type=java.lang.Integer/ file path pattern=/affiliateA/*.csv format type=csv seperator\t/seperator quotesquot;/quotes /format columns column name=Date alias=date format-MM-DD/format /column column name=Clicks alias=clicks/ /columns /file file path pattern=/affiliateB/*.csv format type=csv seperator;/seperator /format columns column index=1 alias=date formatdd/MM/yyy/format /column column index=2 alias=clicks/ /columns /file /field-mapping What I'd like to be able is to use this external descriptor for each file with a custom hadoop InputFormat. Instead of a line of text, my MR values would be a Map containing the parsed values mapped to the field IDs. map(key, fields) { Date date = fields.get('date'); Integer clicks = fields.get('clicks'); } This would allow me to uncouple my MR job from the actual file format and also moves all csv handling code out of my mappers. Does anyone know if such a solution already exists for hadoop? Any thoughts? Stefan
Re: Backing up HDFS?
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote: The key here is to prioritize your data. Impossible to replicate data gets backed up using whatever means necessary, hard-to-regenerate data, next priority. Easy to regenerate and ok to nuke data, doesn't get backed up. I think thats a good advise to start with when creating a backup strategy. E.g. what we do at the moment is to analyze huge volumes of access logs where we import those logs into hdfs, creating aggregates for several metrics and finally storing results in sequence files using block level compression. Its kind of an intermediate format that can be used for further analysis. Those files end up being pretty small and will be exported daily to storage and getting backuped. In case hdfs goes to hell we can restore some raw log data from the servers and only loose historical logs which should not be a big deal. I must also add that I really enjoy the great deal of optimization opportunities that hadoop gives you by directly implementing the serialization strategies. You really get control over every bit and byte that gets recorded. Same with compression. So you can make the best trade offs possible and finally store only data you really need.
Re: How to use DBInputFormat?
On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg fred...@avafan.com wrote: Well, that obviously depend on the RDBMS' implementation. And although the case is not as bad as you describe (otherwise you better ask your RDBMS vendor for your money back), your point is valid. But then again, a RDBMS is not designed for that kind of work. Right. Clash of design paradigms. Hey MySQL, I want my money back!! Oh, wait.. Another scenario I just recognized: what about current/realtime data? E.g. 'select * from logs where date = today()'. Working with 'offset' may turn out to return different results after the table has been updated and tasks are still pending. Pretty ugly to trace down this condition, after you found out that sometimes your results are just not right.. What do you mean by creating splits/map tasks on the fly dynamically? Fredrik On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote: As far as i understand the main problem is that you need to create splits from streaming data with an unknown number of records and offsets. Its just the same problem as with externally compressed data (.gz). You need to go through the complete stream (or do a table scan) to create logical splits. Afterwards each map task needs to seek to the appropriate offset on a new stream over again. Very expansive. As with compressed files, no wonder only one map task is started for each .gz file and will consume the complete file. IMHO the DBInputFormat should follow this behavior and just create 1 split whatsoever. Maybe a future version of hadoop will allow to create splits/map tasks on the fly dynamically? Stefan On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg fred...@avafan.com wrote: Indeed sir. The implementation was designed like you describe for two reasons. First and foremost to make is as simple as possible for the user to use a JDBC database as input and output for Hadoop. Secondly because of the specific requirements the MapReduce framework brings to the table (split distribution, split reproducibility etc). This design will, as you note, never handle the same amount of data as HBase (or HDFS), and was never intended to. That being said, there are a couple of ways that the current design could be augmented to perform better (and, as in its current form, tweaked, depending on you data and computational requirements). Shard awareness is one way, which would let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a limit and offset parameter appended to the input sql query Effectively your input query select * from orders would become select * from orders limit splitlength offset splitstart and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana ama...@gmail.com: Adding a semicolon gives me the error ORA-00911: Invalid character Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Amandeep, SQL command not properly ended I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana ama...@gmail.com: The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321
Re: How to use DBInputFormat?
The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a limit and offset parameter appended to the input sql query Effectively your input query select * from orders would become select * from orders limit splitlength offset splitstart and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana ama...@gmail.com: Adding a semicolon gives me the error ORA-00911: Invalid character Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Amandeep, SQL command not properly ended I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana ama...@gmail.com: The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Exception closing file /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243) at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413) at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236) at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221) Here's my code: public class LoadTable1 extends Configured implements Tool { // data destination on hdfs private static final String CONTRACT_OUTPUT_PATH = contract_table; // The JDBC connection URL and driver implementation class private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost :1521:PSEDEV; private static final String DB_USER = user; private static final String DB_PWD = pass; private static final String DATABASE_DRIVER_CLASS = oracle.jdbc.driver.OracleDriver; private static final String CONTRACT_INPUT_TABLE =
Re: How to use DBInputFormat?
As far as i understand the main problem is that you need to create splits from streaming data with an unknown number of records and offsets. Its just the same problem as with externally compressed data (.gz). You need to go through the complete stream (or do a table scan) to create logical splits. Afterwards each map task needs to seek to the appropriate offset on a new stream over again. Very expansive. As with compressed files, no wonder only one map task is started for each .gz file and will consume the complete file. IMHO the DBInputFormat should follow this behavior and just create 1 split whatsoever. Maybe a future version of hadoop will allow to create splits/map tasks on the fly dynamically? Stefan On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg fred...@avafan.com wrote: Indeed sir. The implementation was designed like you describe for two reasons. First and foremost to make is as simple as possible for the user to use a JDBC database as input and output for Hadoop. Secondly because of the specific requirements the MapReduce framework brings to the table (split distribution, split reproducibility etc). This design will, as you note, never handle the same amount of data as HBase (or HDFS), and was never intended to. That being said, there are a couple of ways that the current design could be augmented to perform better (and, as in its current form, tweaked, depending on you data and computational requirements). Shard awareness is one way, which would let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a limit and offset parameter appended to the input sql query Effectively your input query select * from orders would become select * from orders limit splitlength offset splitstart and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana ama...@gmail.com: Adding a semicolon gives me the error ORA-00911: Invalid character Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Amandeep, SQL command not properly ended I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana ama...@gmail.com: The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65