There has been significant work on building a web-DAV interface for HDFS. I
haven't heard any news for some time, however.
On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote:
>
>> The Eclipse plug-in also features a DFS browser.
>
> Yep. That's all true, I don't mean to self-promot
The web interface can also be used. This is handy if you are following the
progress of the job via the web.
Scroll to the bottom of the page.
On 1/20/08 11:39 PM, "Jeff Hammerbacher" <[EMAIL PROTECTED]>
wrote:
> ./bin/hadoop job -Dmapred.job.tracker=:
> -kill
>
> you can find the required c
I would say that it is generally better practice to deploy hadoop.jar in the
lib directory of the war file that you are deploying so that you can change
versions of hadoop more easily.
Your problem is that you have dropped the tomcat support classes from your
CLASSPATH in the process of getting h
We effectively have this situation on a significant fraction of our
work-load as well. Much of our data is summarized hourly and is encrypted
and compressed which makes it unsplittable. This means that the map
processes are often not local to the data since the data is typically spread
only to
> Yep, I can see all 34 blocks and view chunks of actual data from each
> using the web interface (quite a nifty tool). Any other suggestions?
>
> --Matt
>
> -----Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 18, 2008 11:2
Go into the web interface and look at the file.
See if you can see all of the blocks.
On 1/18/08 7:46 AM, "Matt Herndon" <[EMAIL PROTECTED]> wrote:
> Hello,
>
>
>
> I'm trying to get Hadoop to process a 2 gig file but it seems to only be
> processing the first block. I'm running the exact
would see this:
> /user/bear/output/part-0
>
> I probably got confused on what the part-# means... I thought
> part-# tells how many splits a file has... so far, I have only
> seen part-0. When will it have part-1, 2, etc?
>
>
&
Parallelizing the processing of data occurs at two steps. The first is
during the map phase where the input data file is (hopefully) split across
multiple tasks. This should happen transparently most of the time unless
you have a perverse data format or use unsplittable compression on your
file
This isn't really a question about Hadoop, but is about system
administration basics.
You are probably missing a master boot record (MBR) on the disk. Ask a
local linux expert to help you or look at the Norton documentation.
On 1/16/08 4:59 AM, "Bin YANG" <[EMAIL PROTECTED]> wrote:
> I use th
ould of course help in this case, but what about
> when we process large datasets? Especially if a mapper fails.
>
> Reducers I also setup to use ~1 per core, slightly less.
>
> /Johan
>
> Ted Dunning wrote:
>> Why so many mappers and reducers relative to the number o
Output a constant key in the map function.
On 1/15/08 9:31 PM, "Vadim Zaliva" <[EMAIL PROTECTED]> wrote:
> On Jan 15, 2008, at 17:56, Peter W. wrote:
>
> That would output last 10 values for each key. I need
> to do this across all the keys in the set.
>
> Vadim
>
>> Hello,
>>
>> Try using
op-user@lucene.apache.org
> Sent: Tuesday, January 15, 2008 4:13:11 PM
> Subject: Re: single output file
>
>
>
> On Jan 15, 2008, at 13:57, Ted Dunning wrote:
>
>> This is happening because you have many reducers running, only one
>> of which
>> gets any data
This is happening because you have many reducers running, only one of which
gets any data.
Since you have combiners, this probably isn't a problem. That reducer
should only get as many records as you have maps. It would be a problem if
your reducer were getting lots of input records.
You can
ed.
>
> Miles
>
> On 15/01/2008, John Heidemann <[EMAIL PROTECTED]> wrote:
>>
>> On Tue, 15 Jan 2008 09:09:07 PST, Ted Dunning wrote:
>>>
>>> Regarding the race condition, hadoop builds task specific temporary
>>> directories in the output di
Why so many mappers and reducers relative to the number of machines you
have? This just causes excess heartache when running the job.
My standard practice is to run with a small factor larger than the number of
cores that I have (for instance 3 tasks on a 2 core machine). In fact, I
find it mos
Regarding the race condition, hadoop builds task specific temporary
directories in the output directory, one per reduce task, that hold these
output files (as long as you don't use absolute path names). When the
process completes successfully, the output files from that temporary
directory are mo
That's a fine way.
If you already have a Linux master distribution, then rsync can distribute
the hadoop software very quickly.
On 1/15/08 6:26 AM, "Bin YANG" <[EMAIL PROTECTED]> wrote:
> Dear colleagues,
>
> Right now, I have to deploy ubuntu 7.10 + hadoop 0.15 on 16 PCs.
> One PC will be se
Just run that same command on a different machine.
On 1/14/08 4:33 AM, "[EMAIL PROTECTED]"
<[EMAIL PROTECTED]> wrote:
>
> I have a 4 node cluster setup up & running. Every time I have to copy
> data to HDFS, I copy it to name node and using "hadoop dfs copyfromlocal
> ..." I copy it to HDFS.
Presumably the limit could be made dynamic. The limit could be
max(static_limit, number of cores in cluster / # active jobs)
On 1/10/08 9:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:
> this may be simple - but is this the right solution? (and i have the same
> concern about hod)
>
>
Actually, all of my jobs tend to have one of these phases dominate the time.
It isn't always the same phase that dominates, though, so the consideration
isn't simple.
The fact (if it is a fact) that one phase or another dominates means,
however, that splitting them won't help much.
On 1/10/08
sException within Hadoop; I believe because of the input
>> dataset size (around 90 million lines).
>>
>> I think it is important to make a distinction between setting total
>> number of map/reduce tasks and the number that can run(per job) at any
>> given time.
You may need to upgrade, but 15.1 does just fine with multiple jobs in the
cluster. Use conf.setNumMapTasks(int) and conf.setNumReduceTasks(int).
On 1/9/08 11:25 AM, "Xavier Stevens" <[EMAIL PROTECTED]> wrote:
> Does Hadoop support running simultaneous jobs? If so, what parameters
> do I need
amount of dfs used space,
> reserved space, and non-dfs used space when the out of disk problem
> occurs.
>
> Hairong
>
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 08, 2008 1:37 PM
> To: hadoop-user@lucene.apache.org
sks take a lot of disk space.
>
> Hairong
>
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 08, 2008 1:13 PM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Limit the space used by hadoop on a slave node
>
>
> I thin
> wrote:
> We use,
>
> dfs.datanode.du.pct for 0.14 and dfs.datanode.du.reserved for 0.15.
>
> Change was made in the Jira Hairong mentioned.
> https://issues.apache.org/jira/browse/HADOOP-1463
>
> Koji
>
>> -Original Message-
>> From: Ted Dunning [mailto:[EMAIL
I think I have seen related bad behavior on 15.1.
On 1/8/08 11:49 AM, "Hairong Kuang" <[EMAIL PROTECTED]> wrote:
> Has anybody tried 15.0? Please check
> https://issues.apache.org/jira/browse/HADOOP-1463.
>
> Hairong
> -Original Message-
> From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTE
Can you put this on the wiki or as a comment on the jira? This could be (as
you just noticed) a life-saver.
On 1/8/08 10:48 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:
> never mind. the storageID is logged in the namenode logs. i am able to restore
> the version files and add the datano
Dhruba,
It looks from the discussion like the file was overwritten in place.
Is that good practice? Normally the way that this sort of update is handled
is to write a temp file, move the live file to a backup, then move the temp
file to the live place. Both moves are atomic so the worst case i
This has bitten me as well. It used to be that I would have two possible
partitions depending on which kind of machine I was on. Some machines had
both partitions available, but one was much smaller. Hadoop had a nasty
tendency to fill up the smaller partition. Reordering the partitions in the
everything into one big fat job jar
>
> Am I missing something?
>
> Question, is the JIRA 1622 actually usable yet? I am using a about 14
> day old nightly developers build, so that should have that in that case?
>
> Which way would you go?
>
> Lars
>
>
> Ted
as following this:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg02860.html
>
> Which I could not find on the Wiki really, although the above is a
> commit. Am I missing something?
>
> Lars
>
>
> Ted Dunning wrote:
>> /lib is definitely the way to go.
>>
&g
/lib is definitely the way to go.
But adding gobs and gobs of stuff there makes jobs start slowly because you
have to propagate a multi-megabyte blob to lots of worker nodes.
I would consider adding universally used jars to the hadoop class path on
every node, but I would also expect to face con
The fsck output shows at least one file that doesn't have a replica.
I have seen situations where a block would not replicate. It turned out to
be due to a downed node that had not yet been marked as down. Once the
system finally realized the node was down, the fsck changed from reporting
low r
is sorted out? I
>> am willing to pay consulting fees if I have to. At the moment I am at a
>> loss - sure I trial and error approach would keep me going forward, but
>> I am on a tight deadline too and that counters that approach.
>>
>> Any help is appreciated.
>>
Lars,
Can you dump your documents to external storage (either HDFS or ordinary
file space storage)?
On 1/4/08 10:01 PM, "larsgeorge" <[EMAIL PROTECTED]> wrote:
>
> Jim,
>
> I have inserted about 5million documents into HBase and translate them into
> 15 languages (means I end up with about 7
It can take a long time to decide that a node is down. If that down node
has the last copy of a file, then it won't get replicated.
I run a balancing script every few hours. It wanders through the files and
ups the replication of each file temporarily. This is important because
initial allocat
name in /conf/masters and
> /conf/slaves files. It is working fine.
>
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Thursday, January 03, 2008 1:00 AM
> To: hadoop-user@lucene.apache.org;
> public-hadoop-user-PPu3vs9EauNd/SJB6HiN2Ni2O/[EMAIL
export HADOOP_SLAVE_SLEEP=0.1
>
> # The directory where pid files are stored. /tmp by default.
> # export HADOOP_PID_DIR=/var/hadoop/pids
>
> # A string representing this instance of hadoop. $USER by default.
> # export HADOOP_IDENT_STRING=$USER
>
> # The scheduling priori
Well, you have something very strange going on in your scripts. Have you
looked at hadoop-env.sh?
On 1/2/08 1:58 PM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:
>> /bin/bash: /root/.bashrc: Permission denied
>> localhost: ssh: localhost: Name or service not known
>> /bin/bash: /root/.bashr
I don't know what your problem is, but I note that you appear to be running
processes as root.
This is a REALLY bad idea. It may also be related to your problem.
On 1/2/08 1:33 PM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:
> Hi,
> I am new to Hadoop. I just downloaded release 0.14.4 (ha
That is a good idea. I currently use a shell script that does the rough
equivalent of rsync -av, but it wouldn't be bad to have a one-liner that
solves the same problem.
One (slight) benefit to the scripted approach is that I get a list of
directories to which files have been moved. That lets m
I would like to point out that this is a REALLY bad idiom. You should use a
static initializer.
private static Map usersMap = new HashMap();
Also, since this is a static field in a very small class, there is very
little reason to use a getter. No need for 7 lines of code when one will
do.
if(getUsersMap().get(nKey)==null){
> output.collect(name, ONE);
> getUsersMap().put(nKey, data[12]);
> }
>
> ..
>
>
> }
>
>
> the problem is my hashmap(userMap) is always empty.Now I hope
> my problem is clear.
>
> Thanks,
>
> Helen
>
>
>
I figured.
On 12/29/07 7:03 AM, "Milan Simonovic" <[EMAIL PROTECTED]> wrote:
>
> That's what I wanted to say :) my mistake
>
> Saturday, December 29, 2007, 3:55:31 PM, you wrote:
>
>> Actually, this isn't true (you must know this). Each element is multiplied
>> by every element of the corre
Actually, this isn't true (you must know this). Each element is multiplied
by every element of the corresponding row or column of the other matrix.
This is (thankfully) much less communication.
On 12/29/07 6:48 AM, "Milan Simonovic" <[EMAIL PROTECTED]> wrote:
> Ordinary matrix multiplication m
The most surprising thing about hadoop is the degree to which you are
exactly correct.
My feeling is that what is really happening is that the pain is moving (and
moderating) to the process of adopting map-reduce as a programming paradigm.
Once you do that, the pain is largely over.
On 12/29/07
For dense matrix multiplication, the key problem is that you have O(n^3)
arithmetic operations and O(n^2) element fetches. Most conventional
machines now have nearly 10^2 or larger ratio between the speed of the
arithmetic processor and memory so for n > 100, you should be able to
saturate the ar
This sounds like there is a little bit of confusion going on here.
It is common for people who are starting with Hadoop that they are surprised
when static fields of the mapper do not get shared across all parallel
instances of the map function. This is, of course, because you are running
many m
That is a very small heap. The reduces, in particular, would benefit
sustantially from having more memory.
Other than that (and having fewer reduces), I am at a bit of a loss. I know
that others are working on comparably sized problems without much
difficulty.
There might be an interaction w
Can you say a bit more about your processes? Are they truly parallel maps
without any shared state?
Are you getting a good limit on maximum number of maps and reduces per
machine?
How are you measuring these times? Do they include shuffle time as well as
map time? Do they include time before
Sounds much better to me.
On 12/26/07 7:53 AM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote:
>
> With a secondary sort on the values during the shuffle, nothing would
> need to be kept in memory, since it could all be counted in a single
> scan. Right? Wouldn't that be a much more efficien
My namenode and jobtracker are both on a machine that is a datanode and has
a tasktracker as well. It is also less well outfitted than yours.
I have no problems, but my data is encrypted which might make the CPU/disk
trade-offs very different.
On 12/26/07 12:11 PM, "Jason Venner" <[EMAIL PROTE
That would be a fine way to solve the problem.
You can also pass data in to the maps via the key since the key has little
use for most maps.
On 12/25/07 8:09 PM, "Norbert Burger" <[EMAIL PROTECTED]> wrote:
> How should I approach this? Is overriding InputFileFormat so that the
> header data i
orly with the standard input
> split size as the mean time to finishing a split is very small, vrs
> gigantic memory requirements for large split sizes.
>
> Time to play with parameters again ... since the answer doesn't appear
> to be in working memory for the list.
>
>
&
What are your mappers doing that they run out of memory? Or is it your
reducers?
Often, you can write this sort of program so that you don't have higher
memory requirements for larger splits.
On 12/25/07 1:52 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote:
> We have tried reducing the number of
Ahhh My previous comments assumed that "long-lived" meant jobs that run
for days and days and days (essentially forever).
15 minute jobs with a finite work-list is actually a pretty good match for
map-reduce as implemented by Hadoop.
On 12/25/07 10:04 AM, "Kirk True" <[EMAIL PROTECTED]> wro
1 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:
> On Fri, Dec 21, 2007 at 12:43:38PM -0800, Ted Dunning wrote:
>>
>> * if you need some kind of work-flow, hadoop won't help (but it won't hurt
>> either)
>>
>
> Lets start a discussion around this, seems to be something lots of folks could
> use...
Sorry. I meant to answer that.
The short answer is that hadoop is often reasonable for this sort of
problem, BUT
* if you have lots of little files, you may do better otherwise
* if you can't handle batch-oriented, merge-based designs, then map-reduce
itself isn't going to help you much
* if
Yeah... We have that as well, but I put strict limits on how many readers
are allowed on any NFS data source. With well organized reads, even a
single machine can cause serious load on an ordinary NFS server. I have had
very bad experiences where lots of maps read from a single source; the worst
doop distcp" using multiple trackers to upload files in
> parallel.
>
> Thanks,
>
> Rui
>
> - Original Message
> From: Ted Dunning <[EMAIL PROTECTED]>
> To: hadoop-user@lucene.apache.org
> Sent: Thursday, December 20, 2007 6:01:50 PM
> Subje
Map-reduce is just one way of organizing your computation. If you have
something simpler, then I would say that you are doing fine.
There are plenty of tasks that are best served by a DAG of simple tasks.
Systems like Amazon's simple queue (where tasks come back to life if they
aren't "finished"
On 12/20/07 5:52 PM, "C G" <[EMAIL PROTECTED]> wrote:
> Ted, when you say "copy in the distro" do you need to include the
> configuration files from the running grid? You don't need to actually start
> HDFS on this node do you?
You are correct. You only need the config files (and the hadoo
Just copy the hadoop distro directory to the other machine and use whatever
command you were using before.
A program that uses hadoop just have to have access to all of the nodes
across the net. It doesn't assume anything else.
On 12/20/07 2:35 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:
you rebuild the name
>>>> node. There is no official solution for the high availability problem.
>>>> Most hadoop systems work on batch problems where an hour or two of
>>>> downtime
>>>> every few years is not a problem.
>>
>> Actu
Yes.
I try to always upload data from a machine that is not part of the cluster
for exactly that reason.
I still find that I need to rebalance due to a strange problem in placement.
My datanodes have 10x different sized HDFS disks and I suspect that the
upload is picking datanodes uniformly rath
Well, we are kind of a poster child for this kind of reliability calculus.
We opted for Mogile for real-time serving because we could see how to split
the master into shards and how to do HA on it. For batch oriented processes
where a good processing model is important, we use hadoop.
I would ha
What happened here is that you formatted the name node but have data left
over from the previous incarnation of the namenode. The namenode can't deal
with that situation.
On 12/19/07 11:25 PM, "M.Shiva" <[EMAIL PROTECTED]> wrote:
>
> /**
On 12/19/07 11:17 PM, "M.Shiva" <[EMAIL PROTECTED]> wrote:
> 1.Did Separate machines/nodes needed for Namenode ,Jobtracker, Slavenodes
No. I run my namenode and job-tracker on one of my storage/worker nodes.
You can run everything on a single node and still get some interesting
results becau
You should also be able get quite a bit of mileage out of special purpose
HashMaps. In general, java generic collections incur large to huge
penalties for certain special cases. If you have one of these special cases
or can put up with one, then you may be able to get 1+ order of magnitude
impr
However, it depended
> upon the file output formats I used in the first step.Because I
> got so confused, I thought it would be more important to nail down the
> correct output format in the first step.
>
> -- Jim
>
> On Dec 17, 2007 10:24 PM, Ted Dunning <[EMAIL PROT
the second step, or
> were you asking me why I never set it in the second step?
>
>
> On Dec 17, 2007 10:09 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>>
>> You never set the input format in the second step.
>>
>> But I think you want to stay
alues are
> clear Text, and they can subsequently be read by
> KeyValueTextInputFormat.
>
> On Dec 17, 2007 10:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>>
>>
>> I thought that is what your input file already was. The
>> KeyValueTextInputFormat should
You never set the input format in the second step.
But I think you want to stay with your KeyValueTextInputFormat for input and
TextOutputFormat for output.
On 12/17/07 7:03 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:
>
> So that's a part of the reason that I am having trouble conn
I thought that is what your input file already was. The
KeyValueTextInputFormat should read your input as-is.
When you write out your intermediate values, just make sure that you use
TextOutputFormat and put "DIR" as the key and the directory name as the
value (same with files).
On 12/17/07 6
Part of your problem is that you appear to be using a TextInputFormat (the
default input format). The TIF produces keys that are LongWritable and
values that are Text.
Other input formats produce different types.
With recent versions of hadoop, classes that extend InputFormatBase can (and
I th
Hadoop is new technology. You aren't going to find opportunities to work
with it via job agencies.
That said, there is a growing trend towards scalable systems in general and
Hadoop in particular. Lately, it seems that everywhere I turn around, I
find another startup company using hadoop. I ju
Devaraj is correct that there is no mechanism to create reduce tasks only as
necessary, but remember that each reducer does many reductions. This means
that empty ranges rarely have a large, unbalanced effect.
If this is still a problem you can do two things,
- first, you can use the hash of th
Yes.
On 12/13/07 12:22 PM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote:
> On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote:
>> After indexing, indexes are moved to multiple query servers. ... (how nutch
>> works) With this architecture, you g
:
> On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote:
>>
>> I don't think so (but I don't run nutch)
>>
>> To actually run searches, the search engines copy the index to local
>> storage. Having them in HDFS is very nice, however, as a way to
I don't think so (but I don't run nutch)
To actually run searches, the search engines copy the index to local
storage. Having them in HDFS is very nice, however, as a way to move them
to the right place.
On 12/13/07 10:59 AM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote:
> On Thu, Dec 13,
It seems reasonable that (de)-serialization could be done in threaded
fashion and then just block on the (read) write itself.
That would explain the utilization which is suspect is close to 1/N where N
is the number of processors.
On 12/12/07 2:07 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote:
I guess it would be even more of a surprise, then.
:-)
On 12/12/07 1:36 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
>> Using gcj successfully would be a bit of a surprise.
>
> GCJ 4.2 does NOT work.
3. There is currently no security. Weak user level security will appear
soon (but you will still be able to lie about who you are). Stronger
security is in the works, but you should expect to protect a Hadoop cluster
from the outside.
2. High availability is inherent in hadoop's map-reduce s
Hadoop *normally* uses the Sun JDK. Using gcj successfully would be a bit
of a surprise.
On 12/11/07 11:54 PM, "aonewa" <[EMAIL PROTECTED]> wrote:
>
> hadoop use gcj java but St.Ack said to try SUN's JDK that means modify code
> in hadoop, yes or no?
>
>
> stack-3 wrote:
>>
>> Try SUN's J
Absolutely. Or on a machine scaling page.
On 12/11/07 12:43 PM, "Chris Fellows" <[EMAIL PROTECTED]> wrote:
> Does this belong in the FAQ?
More to the specific point, yes, all 100 nodes will wind up storing data for
large files because blocks should be assigned pretty much at random.
The exception is files that originate on a datanode. There, the local node
gets one copy of each block. Replica blocks follow the random rule,
howeve
The web interface to the namenode will let your drill down to the file
itself. That will tell you where the blocks are (scroll down to the
bottom). You can also use hadoop fsck
For example:
[EMAIL PROTECTED]:~/hadoop-0.15.1$ bin/hadoop fsck
/user/rmobin/data/11/30Statu
Can you post a Jira and a patch?
On 12/10/07 1:12 AM, "Alan Ho" <[EMAIL PROTECTED]> wrote:
> I've written a xml input splitter based on a Stax parser. Its much better than
> StreamXMLRecordReader
>
> - Original Message
> From: Peter Thygesen <[EMAIL PROTECTED]>
> To: hadoop-user@lucen
There is a bug in the GZipInputStream on java 1.5 that can cause an
out-of-memory error on a malformed gzip input.
It is possible that you are trying to treat this input as a splittable file
which is causing your maps to be fed from chunks of the gzip file. Those
chunks would be ill-formed, of c
up DNS and hadoop won't run"?).
Item (B) is probably a bad thing for hadoop given the bandwidth required for
the shuffle phase.
Item (C) is inherent in map-reduce and is pretty neutral either way.
On 12/5/07 9:23 AM, "Ted Dunning" <[EMAIL PROTECTED]> wrote:
>
>
Sorry about not addressing this. (and I appreciate your gentle prod)
The Xgrid would likely work well on these problems. They are, after all,
nearly trivial to parallelize because of clean communication patterns.
Consider an alternative problem of solving n-body gravitational dynamics for
n >
IF you are looking at large numbers of independent images then hadoop should
be close to perfect for this analysis (the problem is embarrassingly
parallel). If you are looking at video, then you can still do quite well by
building what is essentially a probabilistic list of recognized items in th
It is conceivable that memcache would eventually have only or mostly active
objects in memory while hbase might have active pages/tablets/groups of
objects.That might give memcache a bit of an edge.
Another thing that happens with memcache is that memcache can hold the
results of a complex jo
There is the largely undocumented record stream stuff. You define your
records in an IDL-like language which compiles to java code. I haven't used
it, but it doesn't look particularly hard.
I believe that this stuff includes definitions of comparators.
Also, if you just put concatenated keys i
/30)
>
> Hi,
>
> It is getting closer to Friday and I wanted to remind everyone that we
> will be meeting at Gordon Biersch in Palo Alto at 5pm this Fri (11/30):
> http://upcoming.yahoo.com/event/324051/
>
> No formal agenda, but we might have the opportunity to checkou
7;dfs[a-z.]+'
>
> I got:
>
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> Could not create the Java virtual machine.
>
> Thanks,
>
> Rui
>
>
> - Original Message
> From: Ted Dunning <[EMAIL PRO
s.
>
> Thanks,
>
> Rui
>
> - Original Message
> From: Ted Dunning <[EMAIL PROTECTED]>
> To: hadoop-user@lucene.apache.org
> Sent: Sunday, December 2, 2007 6:43:36 AM
> Subject: Re: Running Hadoop on FreeBSD
>
>
>
> You should be able to run it wit
You should be able to run it without any changes or recompilation.
Hadoop is written in Java, after all.
On 11/30/07 10:38 PM, "Rui Shi" <[EMAIL PROTECTED]> wrote:
> Did anyone port and run Hadoop on FreeBSD clusters?
Are you already using memcache and related approaches?
On 11/30/07 9:46 AM, "Mike Perkowitz" <[EMAIL PROTECTED]> wrote:
>
>
> Hello! We have a web site currently built on linux/apache/mysql/php. Most
> pages do some mysql queries and then stuff the results into php/html
> templates. We've be
> Joydeep Sen Sarma wrote:
>> Would it help if the multifileinputformat bundled files into splits based on
>> their location? (wondering if remote copy speed is a bottleneck in map)
>> If you are going to access the files many times after they are generated -
>> wri
1 - 100 of 276 matches
Mail list logo