Re: [gentoo-user] crontab - and' condition

2014-09-19 Thread Alan McKinnon
On 19/09/2014 06:21, Joseph wrote:
 On 09/18/14 19:14, Alan McKinnon wrote:
 On 18/09/2014 18:44, Joseph wrote:
 I want to run a cron job only once a month.  The problem is the computer
 is only on on weekdays Mon-Fri. 1-5

 cron tab as this below is an or condition as it has entries in Days of
 the Months and Day of the  Week

 5 18 1 * 2  rsync -av ...

 so it will run on days 1 or Tuesday of each months.

 Is it possible to create and condition, eg. run it on Tuesday between
 days 1 to 7; depend on which day Tuesday falls on?


 Not in one line.

 Split it into two crontab entries.
 
 Interesting.  How do you split cron job? I couldn't find any examples.
 


No wait, that won't work. What you want to accomplish cannot be done
with a single crontab job.

Use periodic/monthly like the other poster said or use anacron so the
job will run when the machine is next powered on.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] crontab - and' condition

2014-09-19 Thread Stephan Müller
Am 18.09.2014 um 18:44 schrieb Joseph:
 I want to run a cron job only once a month.  The problem is the computer is 
 only on on weekdays Mon-Fri. 1-5
 cron tab as this below is an or condition as it has entries in Days of the 
 Months and Day of the  Week
 
 5 18 1 * 2  rsync -av ...
 
 so it will run on days 1 or Tuesday of each months.
 Is it possible to create and condition, eg. run it on Tuesday between days 
 1 to 7; depend on which day Tuesday falls on?

You can run it every Tuesday and check for day of month externally:

5 18 * * 2   test $(date +%d) -le 7  rsync -av ...

or run it on

5 18 1-7 * * and test for Tuesdays, but the former gives less useless 
invocations.

 ~frukto 






[gentoo-user] Re: File system testing

2014-09-19 Thread James
J. Roeleveld joost at antarean.org writes:


 Out of curiosity, what do you want to simulate?

subsurface flows in porous medium. AKA carbon sequestration
by injection wells. You know, provide proof that those
that remove hydrocarbons and actuall put the CO2 back
and significantly mitigate the effects of their ventures.

It's like this. I have been stuggling with my 17 year old genius
son who is a year away from entering medical school, with
learning responsibility. So I got him a hyperactive, highly
intelligent (mix-doberman) puppy to nurture, raise, train, love
and be resonsible for. It's one genious pup, teaching another
pup about being responsible.

So goes the earl_bidness...imho.



 
  Many folks are recommending to skip Hadoop/HDFS all  together

 I agree, Hadoop/HDFS is for data analysis. Like building a profile 
 about people based on the information companies like Facebook,  
 Google, NSA, Walmart, Governments, Banks, collect about their 
 customers/users/citizens/slaves/

  and go straight to mesos/spark. RDD (in-memory)  cluster
  calculations are at the heart of my needs. The opposite end of the
  spectrum, loads of small files and small apps; I dunno about, but, I'm all
  ears.
  In the end, my (3) node scientific cluster will morph and support
  the typical myriad  of networked applications, but I can take
  a few years to figure that out, or just copy what smart guys like
  you and joost do.
  
 Nope, I'm simply following what you do and provide suggestions where I can.
 Most of the clusters and distributed computing stuff I do is based on 
 adding machines to distribute the load. But the mechanisms for these are 
implemented in the applications I work with, not what I design underneath.

 The filesystems I am interested in are different to the ones you want.

Maybe. I do not know what I want yet. My vision is very light weight 
workstations running lxqt (small memory footprint) or such, and a bad_arse
cluster for the heavy lifting running on whatever heterogenous resoruces I
have. From what I've read, the cluster and the file systems are all
redundant that the cluster level (mesos/spark anyway) regardless of one any
give processor/system is doing. All of Alans fantasies (needs) can be
realized once the cluster stuff is master. (chronos, ansible etc etc).

 I need to provided access to software installation files to a VM server 
 and access to documentation which is created by the users. The 
 VM server is physically next to what I already mentioned as server A.  
 Access to the VM from the remote site will be using remote desktop   
 connections.  But to allow faster and easier access to the 
 documentation, I need a server B at the remote site which functions as 
 described.  AFS might be suitable, but I need to be able to layer Samba 
 on top of that to allow a seamless operation.
 I don't want the laptops to have their own cache and then having to 
 figure out how to solve the multiple different changes to documents 
 containing layouts. (MS Word and OpenDocument files).

Ok so your customers (hperactive problem users) inteface to your cluster
to do their work. When finished you write things out to other servers
with all of the VM servers. Lots of really cool tools are emerging
in the cluster space.

I think these folks have mesos + spark + samba + nfs all in one box. [1]
Build rather than purchase? WE have to figure out what you and Alan need, on
a cluster, because it is what most folks need/want. It the admin_advantage
part of cluster. (There also the Big Science (me) and Web centric needs.
Right now they are realted project, but things will coalesce, imho. There is
even Spark_sql for postgres admins [2].

[1]
http://www.quantaqct.com/en/01_product/02_detail.php?mid=29sid=162id=163qs=102

[2] https://spark.apache.org/sql/


   We use Lustre for our high performance general storage. I don't 
   have any numbers, but I'm pretty sure it is *really* fast (10Gbit/s 
   over IB sounds familiar, but don't quote me on that).
  
  AT Umich, you guys should test the FhGFS/btrfs combo. The folks
  at UCI swear about it, although they are only publishing a wee bit.
  (you know, water cooler gossip).. Surely the Wolverines do not
  want those californians getting up on them?

  Are you guys planning a mesos/spark test?

Personally, I would read up on these and see how they work. Then,
based on that, decide if they are likely to assist in the specific
situation you are interested in.

  It's a ton of reading. It's not apples-to-apple_cider type of reading.
  My head hurts.

 Take a walk outside. Clear air should help you with the headaches :P

Basketball, Boobs and Burbon use to work quite well. Now it's mostly
basketball, but I'm working on someone very cute..

  I'm leaning to  DFS/LFS
  (2)  Luster/btrfs  and FhGFS/btrfs

 I have insufficient knowledge to advise on either of these.
 One question, why BTRFS instead of ZFS?

I think btrfs has 

Re: [gentoo-user] Re: File system testing

2014-09-19 Thread Rich Freeman
On Fri, Sep 19, 2014 at 9:41 AM, James wirel...@tampabay.rr.com wrote:

 I think btrfs has tremendous potential. I tried ZFS a few times,
 but the installs are not part of gentoo, so they got borked
 uEFI, grubs to uuids, etc etc also were in the mix. That was almost
 a year ago. For what ever reason the clustering folks I have
 read and communicated with are using ext4, xfs and btrfs. Prolly
 mostly because those are mostly used in their (systemd) inspired)
 distros?

I do think that btrfs in the long-term is more likely to be mainstream
on linux, but I wouldn't be surprised if getting zfs working on Gentoo
is much easier now.  Richard Yao is both a Gentoo dev and significant
zfs on linux contributor, so I suspect he is doing much of the latter
on the former.


 Yep. the license issue with ZFS is a real killer for me. Besides,
 as an old state-machine, C hack, anything with B-tree is fabulous.
 Prejudices? Yep, but here, I'm sticking with my gut. Multi port
 ram can do mavelous things with Btree data structures. The
 rest will become available/stable. Simply, I just trust btrfs, in
 my gut.

I don't know enough about zfs to compare them, but the design of btrfs
has a certain amount of beauty/symmetry/etc to it IMHO.  I only have
studied it enough to be dangerous and give some intro talks to my LUG,
but just about everything is stored in b-trees, the design allows both
fixed and non-fixed length nodes within the trees, and just about
everything about the filesystem is dynamic other than the superblocks,
which do little more than ID the filesystem and point to the current
tree roots.  The important stuff is all replicated and versioned.

I wouldn't be surprised if it shared many of these design features
with other modern filesystems, and I do not profess to be an expert on
modern filesystem design, so I won't make any claims about btrfs being
better/worse than other filesystems in this regard.  However, I would
say that anybody interested in data structures would do well to study
it.

--
Rich



Re: [gentoo-user] Re: File system testing

2014-09-19 Thread J. Roeleveld

On Friday, September 19, 2014 01:41:26 PM James wrote:
 J. Roeleveld joost at antarean.org writes:
  Out of curiosity, what do you want to simulate?
 
 subsurface flows in porous medium. AKA carbon sequestration
 by injection wells. You know, provide proof that those
 that remove hydrocarbons and actuall put the CO2 back
 and significantly mitigate the effects of their ventures.

Interesting topic. Can't provide advice on that topic.

 It's like this. I have been stuggling with my 17 year old genius
 son who is a year away from entering medical school, with
 learning responsibility. So I got him a hyperactive, highly
 intelligent (mix-doberman) puppy to nurture, raise, train, love
 and be resonsible for. It's one genious pup, teaching another
 pup about being responsible.

Overactive kids, always fun.
I try to keep mine busy without computers and TVs for now. (She's going to be 
3 in November)

 So goes the earl_bidness...imho.
 
   Many folks are recommending to skip Hadoop/HDFS all  together
  
  I agree, Hadoop/HDFS is for data analysis. Like building a profile
  about people based on the information companies like Facebook,
  Google, NSA, Walmart, Governments, Banks, collect about their
  customers/users/citizens/slaves/
  
   and go straight to mesos/spark. RDD (in-memory)  cluster
   calculations are at the heart of my needs. The opposite end of the
   spectrum, loads of small files and small apps; I dunno about, but, I'm
   all
   ears.
   In the end, my (3) node scientific cluster will morph and support
   the typical myriad  of networked applications, but I can take
   a few years to figure that out, or just copy what smart guys like
   you and joost do.
  
   
  Nope, I'm simply following what you do and provide suggestions where I
  can.
  Most of the clusters and distributed computing stuff I do is based on
  adding machines to distribute the load. But the mechanisms for these are 
  implemented in the applications I work with, not what I design underneath.
  The filesystems I am interested in are different to the ones you want.
 
 Maybe. I do not know what I want yet. My vision is very light weight
 workstations running lxqt (small memory footprint) or such, and a bad_arse
 cluster for the heavy lifting running on whatever heterogenous resoruces I
 have. From what I've read, the cluster and the file systems are all
 redundant that the cluster level (mesos/spark anyway) regardless of one any
 give processor/system is doing. All of Alans fantasies (needs) can be
 realized once the cluster stuff is master. (chronos, ansible etc etc).

Alan = your son? or?
I would, from the workstation point of view, keep the cluster as a single 
entity, to keep things easier.
A cluster FS for workstation/desktop use is generally not suitable for a High 
Performance Cluster (HPC) (or vice-versa)

  I need to provided access to software installation files to a VM server
  and access to documentation which is created by the users. The
  VM server is physically next to what I already mentioned as server A.
  Access to the VM from the remote site will be using remote desktop
  connections.  But to allow faster and easier access to the
  documentation, I need a server B at the remote site which functions as
  described.  AFS might be suitable, but I need to be able to layer Samba
  on top of that to allow a seamless operation.
  I don't want the laptops to have their own cache and then having to
  figure out how to solve the multiple different changes to documents
  containing layouts. (MS Word and OpenDocument files).
 
 Ok so your customers (hperactive problem users) inteface to your cluster
 to do their work. When finished you write things out to other servers
 with all of the VM servers. Lots of really cool tools are emerging
 in the cluster space.

Actually, slightly different scenario.
Most work is done at customers systems. Occasionally we need to test software 
versions prior to implementing these at customers. For that, we use VMs.

The VM-server we have is currently sufficient for this. When it isn't, we'll 
need to add a 2nd VMserver.

On the NAS, we store:
- Documentation about customers + Howto documents on how to best install the 
software.
- Installation files downloaded from vendors (We also deal with older versions 
that are no longer available. We need to have our own collection to handle 
that)

As we are looking into also working from a different location, we need:
- Access to the VM-server (easy, using VPN and Remote Desktops)
- Access to the files (I prefer to have a local 'cache' at the remote location)

It's the access to files part where I need to have some sort of distributed 
filesystem.

 I think these folks have mesos + spark + samba + nfs all in one box. [1]
 [1]
 http://www.quantaqct.com/en/01_product/02_detail.php?mid=29sid=162id=163q
 s=102

Had a quick look, these use MS Windows Storage 2012, this is only failover on 
the storage side. I don't see anything related to 

Re: [gentoo-user] Re: File system testing

2014-09-19 Thread J. Roeleveld

On Friday, September 19, 2014 10:56:59 AM Rich Freeman wrote:
 On Fri, Sep 19, 2014 at 9:41 AM, James wirel...@tampabay.rr.com wrote:
  I think btrfs has tremendous potential. I tried ZFS a few times,
  but the installs are not part of gentoo, so they got borked
  uEFI, grubs to uuids, etc etc also were in the mix. That was almost
  a year ago. For what ever reason the clustering folks I have
  read and communicated with are using ext4, xfs and btrfs. Prolly
  mostly because those are mostly used in their (systemd) inspired)
  distros?
 
 I do think that btrfs in the long-term is more likely to be mainstream
 on linux, but I wouldn't be surprised if getting zfs working on Gentoo
 is much easier now.  Richard Yao is both a Gentoo dev and significant
 zfs on linux contributor, so I suspect he is doing much of the latter
 on the former.

Don't have the link handy, but there is an howto about it that, when followed, 
will give a ZFS pool running on Gentoo in a very short time. (emerge zfs is 
the longest part of the whole thing)
Not even needed to reboot.

  Yep. the license issue with ZFS is a real killer for me. Besides,
  as an old state-machine, C hack, anything with B-tree is fabulous.
  Prejudices? Yep, but here, I'm sticking with my gut. Multi port
  ram can do mavelous things with Btree data structures. The
  rest will become available/stable. Simply, I just trust btrfs, in
  my gut.
 
 I don't know enough about zfs to compare them, but the design of btrfs
 has a certain amount of beauty/symmetry/etc to it IMHO.  I only have
 studied it enough to be dangerous and give some intro talks to my LUG,
 but just about everything is stored in b-trees, the design allows both
 fixed and non-fixed length nodes within the trees, and just about
 everything about the filesystem is dynamic other than the superblocks,
 which do little more than ID the filesystem and point to the current
 tree roots.  The important stuff is all replicated and versioned.
 
 I wouldn't be surprised if it shared many of these design features
 with other modern filesystems, and I do not profess to be an expert on
 modern filesystem design, so I won't make any claims about btrfs being
 better/worse than other filesystems in this regard.  However, I would
 say that anybody interested in data structures would do well to study
 it.

I like the idea of both and hope BTRFS will also come with the raid-6-like 
features and good support for larger drive counts (I've got 16 available for 
the filestorage) to make it, for me, a viable alternative to ZFS.

--
Joost



Re: [gentoo-user] Re: File system testing

2014-09-19 Thread Kerin Millar

On 18/09/2014 14:12, Alec Ten Harmsel wrote:


On 09/18/2014 05:17 AM, Kerin Millar wrote:

On 17/09/2014 21:20, Alec Ten Harmsel wrote:

As far as HDFS goes, I would only set that up if you will use it for
Hadoop or related tools. It's highly specific, and the performance is
not good unless you're doing a massively parallel read (what it was
designed for). I can elaborate why if anyone is actually interested.


I, for one, am very interested.

--Kerin



Alright, here goes:

Rich Freeman wrote:


FYI - one very big limitation of hdfs is its minimum filesize is
something huge like 1MB or something like that.  Hadoop was designed
to take a REALLY big input file and chunk it up.  If you use hdfs to
store something like /usr/portage it will turn into the sort of
monstrosity that you'd actually need a cluster to store.


This is exactly correct, except we run with a block size of 128MB, and a large 
cluster will typically have a block size of 256MB or even 512MB.

HDFS has two main components: a NameNode, which keeps track of which blocks are 
a part of which file (in memory), and the DataNodes that actually store the 
blocks. No data ever flows through the NameNode; it negotiates transfers 
between the client and DataNodes and negotiates transfers for jobs. Since the 
NameNode stores metadata in-memory, small files are bad because RAM gets wasted.

What exactly is Hadoop/HDFS used for? The most common uses are generating 
search indices on data (which is a batch job) and doing non-realtime processing 
of log streams and/or data streams (another batch job) and allowing a large 
number of analysts run disparate queries on the same large dataset (another 
batch job). Batch processing - processing the entire dataset - is really where 
Hadoop shines.

When you put a file into HDFS, it gets split based on the block size. This is 
done so that a parallel read will be really fast - each map task reads in a 
single block and processes it. Ergo, if you put in a 1GB file with a 128MB 
block size and run a MapReduce job, 8 map tasks will be launched. If you put in 
a 1TB file, 8192 tasks would be launched. Tuning the block size is important to 
optimize the overhead of launching tasks vs. potentially under-utilizing a 
cluster. Typically, a cluster with a lot of data has a bigger block size.

The downsides of HDFS:
* Seeked reads are not supported afaik because no one needs that for batch 
processing
* Seeked writes into an existing file are not supported because either blocks 
would be added in the middle of a file and wouldn't be 128MB, or existing 
blocks would be edited, resulting in blocks larger than 128MB. Both of these 
scenarios are bad.

Since HDFS users typically do not need seeked reads or seeked writes, these 
downsides aren't really a big deal.

If something's not clear, let me know.


Thank you for taking the time to explain.

--Kerin