Re: Native (GZIP) decompress not faster than builtin

2009-05-10 Thread Stefan Podkowinski
Jens,

As your test shows, using a native codec won't make much sense for
small files, since the involved JNI overhead will likely out-weight
any possible gains.  With all the performance improvements in java 5 +
6 its reasonable to ask whether the native implementation does really
improve performance. I'd look at it as another option to further
squeeze out some more performance if you really need to.

- Stefan

On Sun, May 10, 2009 at 11:03 AM, Jens Riboe jens.ri...@ribomation.com wrote:
 Hi,

 During the past week I decided to use native decompress for a Hadoop job
 (using 0.20.0). But before implementing it I decided to write a small
 benchmark just so understand how much faster (better) it was. The result
 came out as a surprise

 May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader clinit
 INFO: Loaded the native-hadoop library
 May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory
 clinit
 INFO: Successfully loaded  initialized native-zlib library
 May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool
 getDecompressor
 INFO: Got brand-new decompressor
 Time of Hadoop  decompressor running 'small' job = 0:00:01.684 (1.684
 ms/file)
 Time of Hadoop  decompressor running 'large' job = 0:00:10.074 (1007.400
 ms/file)
 Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340
 ms/file)
 Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400
 ms/file)
 Hadoop vs. Vanilla [small]: 125.67%
 Hadoop vs. Vanilla [large]: 99.80%

 For a small file, Hadoop native decompress takes 25% longer time to run that
 Java's built-in GZIPInputStream and for a few megabyte sized file the speed
 difference is negligible.

 I wrote a blog post about it which also contains the full source code of the
 benchmark.
 http://blog.ribomation.com/2009/05/07/comparison-of-decompress-ways-in-hadoo
 p/

 My questions are:
 [1]  Am I missing some key information for how to correctly use native GZIP
 compress?
        I'm using codec pooling by the way.

 [2]  Will native decompress only take off for files larger than 100MB or
 1000MB?
        In my application I'm reading many KB sized gz files from an
 external source,
        So I cannot change the compress method nor the file size.

 [3]  Has anybody experienced something similar to my result?


 Kind regards /jens




Re: large files vs many files

2009-05-10 Thread Stefan Podkowinski
You just can't have many distributed jobs write into the same file
without locking/synchronizing these writes. Even with append(). Its
not different than using a regular file from multiple processes in
this respect.
Maybe you need to collect your data in front before processing them in hadoop?
Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa


On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy sdo...@gmail.com wrote:
 Would WritableFactories not allow me to open one outputstream and continue
 to write() and sync() ?

 Maybe I'm reading into that wrong.  Although UUID would be nice, it would
 still leave me in the problem of having lots of little files instead of a
 few large files.

 -sd

 On Sat, May 9, 2009 at 8:37 AM, jason hadoop jason.had...@gmail.com wrote:

 You must create unique file names, I don't believe (but I do not know) that
 the append could will allow multiple writers.

 Are you writing from within a task, or as an external application writing
 into hadoop.

 You may try using UUID,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of
 your
 filename.
 Without knowing more about your goals, environment and constraints it is
 hard to offer any more detailed suggestions.
 You could also have an application aggregate the streams and write out
 chunks, with one or more writers, one per output file.





Re: Hadoop / MySQL

2009-04-29 Thread Stefan Podkowinski
If you have trouble loading your data into mysql using INSERTs or LOAD
DATA, consider that MySQL supports CSV directly using the CSV storage
engine. The only thing you have to do is to copy your hadoop produced
csv file into the mysql data directory and issue a flush tables
command to have mysql flush its caches and pickup the new file. Its
very simple and you have the full set of sql commands available just
as with innodb or myisam. What you don't get with the csv engine are
indexes and foreign keys. Can't have it all, can you?

Stefan


On Tue, Apr 28, 2009 at 9:23 PM, Bill Habermaas b...@habermaas.us wrote:
 Excellent discussion. Thank you Todd.
 You're forgiven for being off topic (at least by me).
 :)
 Bill

 -Original Message-
 From: Todd Lipcon [mailto:t...@cloudera.com]
 Sent: Tuesday, April 28, 2009 2:29 PM
 To: core-user
 Subject: Re: Hadoop / MySQL

 Warning: derailing a bit into MySQL discussion below, but I think enough
 people have similar use cases that it's worth discussing this even though
 it's gotten off-topic.

 2009/4/28 tim robertson timrobertson...@gmail.com


 So we ended up with 2 DBs
 - DB1 we insert to, prepare and do batch processing
 - DB2 serving the read only web app


 This is a pretty reasonable and common architecture. Depending on your
 specific setup, instead of flip-flopping between DB1 and DB2, you could
 actually pull snapshots of MyISAM tables off DB1 and load them onto other
 machines. As long as you've flushed the tables with a read lock, MyISAM
 tables are transferrable between machines (eg via rsync). Obviously this can
 get a bit hairy, but it's a nice trick to consider for this kind of
 workflow.


 Why did we end up with this?  Because of locking on writes that kill
 reads as you say... basically you can't insert when a read is
 happening on myisam as it locks the whole table.


 This is only true if you have binary logging enabled. Otherwise, myisam
 supports concurrent inserts with reads. That said, binary logging is
 necessary if you have any slaves. If you're loading bulk data from the
 result of a mapreduce job, you might be better off not using replication and
 simply loading the bulk data to each of the serving replicas individually.
 Turning off the binary logging will also double your write speed (LOAD DATA
 writes the entirety of the data to the binary log as well as to the table)


  InnoDB has row level
 locking to get around this but in our experience (at the time we had
 130million records) it just didn't work either.


 You're quite likely to be hitting the InnoDB autoincrement lock if you have
 an autoincrement primary key here. There are fixes for this in MySQL 5.1.
 The best solution is to avoid autoincrement primary keys and use LOAD DATA
 for these kind of bulk loads, as others have suggested.


  We spent €10,000 for
 the supposed european expert on mysql from their professional
 services and were unfortunately very disappointed.  Seems such large
 tables are just problematic with mysql.  We are now very much looking
 into Lucene technologies for search and Hadoop for reporting and
 datamining type operations. SOLR does a lot of what our DB does for
 us.


 Yep - oftentimes MySQL is not the correct solution, but other times it can
 be just what you need. If you already have competencies with MySQL and a
 good access layer from your serving tier, it's often easier to stick with
 MySQL than add a new technology into the mix.



 So with myisam... here is what we learnt:

 Only very latest mysql versions (beta still I think) support more than
 4G memory for indexes (you really really need the index in memory, and
 where possible the FK for joins in the index too).


 As far as I know, any 64-bit mysql instance will use more than 4G without
 trouble.


  Mysql has
 differing join strategies between innoDB and myisam, so be aware.


 I don't think this is true. Joining happens at the MySQL execution layer,
 which is above the storage engine API. The same join strategies are
 available for both. For a particular query, InnoDB and MyISAM tables may end
 up providing a different query plan based on the statistics that are
 collected, but given properly analyzed tables, the strategies will be the
 same. This is how MySQL allows inter-storage-engine joins. If one engine is
 providing a better query plan, you can use query hints to enforce that plan
 (see STRAIGHT_JOIN and FORCE INDEX for example)


 An undocumented feature of myisam is you can create memory buffers for
 single indexes:
 In the my.cnf:
     taxon_concept_cache.key_buffer_size=3990M    -- for some reason
 you have to drop a little under 4G

 then in the DB run:
    cache index taxon_concept in taxon_concept_cache;
    load index into cache taxon_concept;

 This allows for making sure an index gets into memory for sure.


 But for most use cases and a properly configured machine you're better off
 letting it use its own caching policies to keep hot indexes in RAM.



 And here 

Re: hadoop job controller

2009-04-02 Thread Stefan Podkowinski
You can get the job progress and completion status through an instance
of org.apache.hadoop.mapred.JobClient . If you really want to use perl
I guess you still need to write a small java application that talks to
perl and JobClient on the other side.
Theres also some support for Thrift in the hadoop contrib package, but
I'm not sure if it exposes any job client related methods.

On Thu, Apr 2, 2009 at 12:46 AM, Elia Mazzawi
elia.mazz...@casalemedia.com wrote:

 I'm writing a perl program to submit jobs to the cluster,
 then wait for the jobs to finish, and check that they have completed
 successfully.

 I have some questions,

 this shows what is running
 ./hadoop job  -list

 and this shows the completion
 ./hadoop job -status  job_200903061521_0045


 but i want something that just says pass / fail
 cause with these, i have to check that its done then check that its 100%
 completed.

 which must exist since the webapp jobtracker.jsp knows what is what.

 also a controller like that must have been written many times already,  are
 there any around?

 Regards,
 Elia



Re: ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
On Tue, Mar 31, 2009 at 1:23 PM, Mikhail Yakshin
greycat.na@gmail.com wrote:

 Couldn't you please explain, what does it do or at least what do you
 want it to do? Why is it better than default Hadoop web UI?


Mikhail. We needed a full featured hdfs file manager for end-users
that could be distributed over the web. Its something we haven't found
out there  (webdav or fuse not being an option for us) and may also be
useful for other hadoop users. Its an offspring of a commercial
platform we're about to develop, along with the job tracker for
internal use and other modules, related and unrelated to hadoop.

@W:
Thanks :) I haven't tried it with trunk. I'm going to create a custom
branch for 0.18 and trunk when theres time for it..

@vishal:
Job history is not implemented yet. I haven't yet figured out exactly
how to do this. Its on the list..

- Stefan


Re: ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
Hi Brian

On Tue, Mar 31, 2009 at 3:46 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
 Hey Stefan,

 I like it.  I would like to hear a bit how the security policies work.  If I
 open this up to the world, how does the world authenticate/authorize
 with my cluster?

Not at all. The daemon part of Hadoop UI is running under a
configurable user and will issue calls to Hadoop on behalf of this
user. Its not much different from the standard web UI in this domain.
The plan is to introduce a authentication layer with one of the next
releases. It will be based on Spring Security and thus enables you to
use many different authentication providers. So downloading all those
Spring libraries along with the project will finally pay off ;)

 I'd love nothing more to be able to give my users a dead-simple way to move
 files on and off the cluster.  This appears to be a step in the right
 direction.

 I'm not familiar with Adobe Flex -- how will this affect other's abilities
 to use it (i.e., Linux  Mac folks?) and how will this affect the ability to
 contribute (i.e., if you get a new job, are the users of this project
 screwed?).  Gah, I sound like my boss.

Theres nothing arcane about Flex, but please don't tell anybody. You
can get the recently open sourced (MPL) SDK for any platform
supporting Java and compile Hadoop UI using ant. Other libraries used
are flexlib (MIT license), Spring (Apache L.),  BlazeDS (LGPL). In
case I would have to look for a new job, as you suggest, other people
would be able to fork as long as they know some Action Script, the
actual language used in Flex, and some XML.

- Stefan


Re: Join Variation

2009-03-24 Thread Stefan Podkowinski
Have you considered hbase for this particular task?
Looks like a simple lookup using the network mask as key would solve
your problem.

Its also possible to derive the network class (A,B,C) based on the
network class of the concerned ip. But I guess your search file will
cover ranges in more detail than just on class level.

On Tue, Mar 24, 2009 at 12:33 PM, Tamir Kamara tamirkam...@gmail.com wrote:
 Hi,

 We need to implement a Join with a between operator instead of an equal.
 What we are trying to do is search a file for a key where the key falls
 between two fields in the search file like this:

 main file (ip, a, b):
 (80, zz, yy)
 (125, vv, bb)

 search file (from-ip, to-ip, d, e):
 (52, 75, xxx, yyy)
 (78, 98, aaa, bbb)
 (99, 115, xxx, ddd)
 (125, 130, hhh, aaa)
 (150, 162, qqq, sss)

 the outcome should be in the form (ip, a, b, d, e):
 (80, zz, yy, aaa, bbb)
 (125, vv, bb, eee, hhh)

 We could convert the ip ranges in the search file to single record ips and
 then do a regular join, but the number of single ips is huge and this is
 probably not a good way.
 What would be a good course for doing this in hadoop ?


 Thanks,
 Tamir



csv input format handling and mapping

2009-03-13 Thread Stefan Podkowinski
Hi

Can anyone share his experience or solution for the following problem?
I'm having to deal with a lot of different file formats, most of them csv.
Each of them shares similar semantics, ie. fields in file A exists in
file B as well.
What I'm not sure of is the exact index of the field in the csv file.
Fields in file A may also have different names for the same thing as in file B.

Simplified example:

affiliateA.csv:
Date; Clicks; Views; Orders
2009-03-10; 10; 20; 4

affiliateB.csv
Date; Orders; Impressions; Clicks
13/03/09; 40; 2000; 1000


Possible mapping file:

field-mapping
  field id=date type=java.util.Date/
  field id=clicks type=java.lang.Integer/
  field id=views type=java.lang.Integer/

  file
 path pattern=/affiliateA/*.csv
 format type=csv
seperator\t/seperator
quotesquot;/quotes
 /format
 columns
column name=Date alias=date
 format-MM-DD/format
/column
column name=Clicks alias=clicks/
 /columns
  /file

  file
 path pattern=/affiliateB/*.csv
 format type=csv
seperator;/seperator
 /format
 columns
column index=1 alias=date
 formatdd/MM/yyy/format
/column
column index=2 alias=clicks/
 /columns
  /file
/field-mapping


What I'd like to be able is to use this external descriptor for each
file with a custom hadoop InputFormat.
Instead of a line of text, my MR values would be a Map containing the
parsed values mapped to the field IDs.

map(key, fields) {
  Date date = fields.get('date');
  Integer clicks = fields.get('clicks');
}

This would allow me to uncouple my MR job from the actual file format
and also moves all csv handling code out of my mappers.

Does anyone know if such a solution already exists for hadoop? Any thoughts?

Stefan


Re: Backing up HDFS?

2009-02-12 Thread Stefan Podkowinski
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote:

 The key here is to prioritize your data.  Impossible to replicate data gets
 backed up using whatever means necessary, hard-to-regenerate data, next
 priority. Easy to regenerate and ok to nuke data, doesn't get backed up.


I think thats a good advise to start with when creating a backup strategy.
E.g. what we do at the moment is to analyze huge volumes of access
logs where we import those logs into hdfs, creating aggregates for
several metrics and finally storing results in sequence files using
block level compression. Its kind of an intermediate format that can
be used for further analysis. Those files end up being pretty small
and will be exported daily to storage and getting backuped. In case
hdfs goes to hell we can restore some raw log data from the servers
and only loose historical logs which should not be a big deal.

I must also add that I really enjoy the great deal of optimization
opportunities that hadoop gives you by directly implementing the
serialization strategies. You really get control over every bit and
byte that gets recorded. Same with compression. So you can make the
best trade offs possible and finally store only data you really need.


Re: How to use DBInputFormat?

2009-02-06 Thread Stefan Podkowinski
On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg fred...@avafan.com wrote:
 Well, that obviously depend on the RDBMS' implementation. And although the
 case is not as bad as you describe (otherwise you better ask your RDBMS
 vendor for your money back), your point is valid. But then again, a RDBMS is
 not designed for that kind of work.

Right. Clash of design paradigms. Hey MySQL, I want my money back!! Oh, wait..
Another scenario I just recognized: what about current/realtime
data? E.g. 'select * from logs where date = today()'. Working with
'offset' may turn out to return different results after the table has
been updated and tasks are still pending. Pretty ugly to trace down
this condition, after you found out that sometimes your results are
just not right..


 What do you mean by creating splits/map tasks on the fly dynamically?


 Fredrik


 On Feb 5, 2009, at 4:49 PM, Stefan Podkowinski wrote:

 As far as i understand the main problem is that you need to create
 splits from streaming data with an unknown number of records and
 offsets. Its just the same problem as with externally compressed data
 (.gz). You need to go through the complete stream (or do a table scan)
 to create logical splits. Afterwards each map task needs to seek to
 the appropriate offset on a new stream over again. Very expansive. As
 with compressed files, no wonder only one map task is started for each
 .gz file and will consume the complete file. IMHO the DBInputFormat
 should follow this behavior and just create 1 split whatsoever.
 Maybe a future version of hadoop will allow to create splits/map tasks
 on the fly dynamically?

 Stefan

 On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg fred...@avafan.com
 wrote:

 Indeed sir.

 The implementation was designed like you describe for two reasons. First
 and
 foremost to make is as simple as possible for the user to use a JDBC
 database as input and output for Hadoop. Secondly because of the specific
 requirements the MapReduce framework brings to the table (split
 distribution, split reproducibility etc).

 This design will, as you note, never handle the same amount of data as
 HBase
 (or HDFS), and was never intended to. That being said, there are a couple
 of
 ways that the current design could be augmented to perform better (and,
 as
 in its current form, tweaked, depending on you data and computational
 requirements). Shard awareness is one way, which would let each
 database/tasktracker-node execute mappers on data where each split is a
 single database server for example.

 If you have any ideas on how the current design can be improved, please
 do
 share.


 Fredrik

 On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:

 The 0.19 DBInputFormat class implementation is IMHO only suitable for
 very simple queries working on only few datasets. Thats due to the
 fact that it tries to create splits from the query by
 1) getting a count of all rows using the specified count query (huge
 performance impact on large tables)
 2) creating splits by issuing an individual query for each split with
 a limit and offset parameter appended to the input sql query

 Effectively your input query select * from orders would become
 select * from orders limit splitlength offset splitstart and
 executed until count has been reached. I guess this is not working sql
 syntax for oracle.

 Stefan


 2009/2/4 Amandeep Khurana ama...@gmail.com:

 Adding a semicolon gives me the error ORA-00911: Invalid character

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com
 wrote:

 Amandeep,
 SQL command not properly ended
 I get this error whenever I forget the semicolon at the end.
 I know, it doesn't make sense, but I recommend giving it a try

 Rasit

 2009/2/4 Amandeep Khurana ama...@gmail.com:

 The same query is working if I write a simple JDBC client and query
 the
 database. So, I'm probably doing something wrong in the connection

 settings.

 But the error looks to be on the query side more than the connection

 side.

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com

 wrote:

 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

 at



 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321

Re: How to use DBInputFormat?

2009-02-05 Thread Stefan Podkowinski
The 0.19 DBInputFormat class implementation is IMHO only suitable for
very simple queries working on only few datasets. Thats due to the
fact that it tries to create splits from the query by
1) getting a count of all rows using the specified count query (huge
performance impact on large tables)
2) creating splits by issuing an individual query for each split with
a limit and offset parameter appended to the input sql query

Effectively your input query select * from orders would become
select * from orders limit splitlength offset splitstart and
executed until count has been reached. I guess this is not working sql
syntax for oracle.

Stefan


2009/2/4 Amandeep Khurana ama...@gmail.com:
 Adding a semicolon gives me the error ORA-00911: Invalid character

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Amandeep,
 SQL command not properly ended
 I get this error whenever I forget the semicolon at the end.
 I know, it doesn't make sense, but I recommend giving it a try

 Rasit

 2009/2/4 Amandeep Khurana ama...@gmail.com:
  The same query is working if I write a simple JDBC client and query the
  database. So, I'm probably doing something wrong in the connection
 settings.
  But the error looks to be on the query side more than the connection
 side.
 
  Amandeep
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com
 wrote:
 
  Thanks Kevin
 
  I couldnt get it work. Here's the error I get:
 
  bin/hadoop jar ~/dbload.jar LoadTable1
  09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
  processName=JobTracker, sessionId=
  09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
  09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
  09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
  09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
  java.io.IOException: ORA-00933: SQL command not properly ended
 
  at
 
 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
  at
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
  java.io.IOException: Job failed!
  at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  at LoadTable1.run(LoadTable1.java:130)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at LoadTable1.main(LoadTable1.java:107)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
 Source)
  at java.lang.reflect.Method.invoke(Unknown Source)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
  at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
  Exception closing file
 
 /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
  java.io.IOException: Filesystem closed
  at
 org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
  at
 org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
  at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
  at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
  at
  org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
  at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
  at
 
 org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
  at
  org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
  at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
  at
  org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)
 
 
  Here's my code:
 
  public class LoadTable1 extends Configured implements Tool  {
 
// data destination on hdfs
private static final String CONTRACT_OUTPUT_PATH =
 contract_table;
 
// The JDBC connection URL and driver implementation class
 
  private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost
  :1521:PSEDEV;
private static final String DB_USER = user;
private static final String DB_PWD = pass;
private static final String DATABASE_DRIVER_CLASS =
  oracle.jdbc.driver.OracleDriver;
 
private static final String CONTRACT_INPUT_TABLE =
  

Re: How to use DBInputFormat?

2009-02-05 Thread Stefan Podkowinski
As far as i understand the main problem is that you need to create
splits from streaming data with an unknown number of records and
offsets. Its just the same problem as with externally compressed data
(.gz). You need to go through the complete stream (or do a table scan)
to create logical splits. Afterwards each map task needs to seek to
the appropriate offset on a new stream over again. Very expansive. As
with compressed files, no wonder only one map task is started for each
.gz file and will consume the complete file. IMHO the DBInputFormat
should follow this behavior and just create 1 split whatsoever.
Maybe a future version of hadoop will allow to create splits/map tasks
on the fly dynamically?

Stefan

On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg fred...@avafan.com wrote:
 Indeed sir.

 The implementation was designed like you describe for two reasons. First and
 foremost to make is as simple as possible for the user to use a JDBC
 database as input and output for Hadoop. Secondly because of the specific
 requirements the MapReduce framework brings to the table (split
 distribution, split reproducibility etc).

 This design will, as you note, never handle the same amount of data as HBase
 (or HDFS), and was never intended to. That being said, there are a couple of
 ways that the current design could be augmented to perform better (and, as
 in its current form, tweaked, depending on you data and computational
 requirements). Shard awareness is one way, which would let each
 database/tasktracker-node execute mappers on data where each split is a
 single database server for example.

 If you have any ideas on how the current design can be improved, please do
 share.


 Fredrik

 On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:

 The 0.19 DBInputFormat class implementation is IMHO only suitable for
 very simple queries working on only few datasets. Thats due to the
 fact that it tries to create splits from the query by
 1) getting a count of all rows using the specified count query (huge
 performance impact on large tables)
 2) creating splits by issuing an individual query for each split with
 a limit and offset parameter appended to the input sql query

 Effectively your input query select * from orders would become
 select * from orders limit splitlength offset splitstart and
 executed until count has been reached. I guess this is not working sql
 syntax for oracle.

 Stefan


 2009/2/4 Amandeep Khurana ama...@gmail.com:

 Adding a semicolon gives me the error ORA-00911: Invalid character

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Amandeep,
 SQL command not properly ended
 I get this error whenever I forget the semicolon at the end.
 I know, it doesn't make sense, but I recommend giving it a try

 Rasit

 2009/2/4 Amandeep Khurana ama...@gmail.com:

 The same query is working if I write a simple JDBC client and query the
 database. So, I'm probably doing something wrong in the connection

 settings.

 But the error looks to be on the query side more than the connection

 side.

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com

 wrote:

 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

   at


 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)

   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
   at

 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
   at

 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)

   at LoadTable1.run(LoadTable1.java:130)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at LoadTable1.main(LoadTable1.java:107)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown

 Source)

   at java.lang.reflect.Method.invoke(Unknown Source)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
   at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65