another idea  i just had would be that you might want to design for taking
snapshots of your wikipedia data anyway, since it changes over time it
would be nice to have, say, an hourly snapshot stored.  but 128GB is
expensive to store, archivally, although S3 gets pretty cheap:

https://aws.amazon.com/s3/pricing/

looks like glacier (the only difference here from normal S3 is is that
retrieval times are in the range of minutes rather than instantly, it's a
function of how the clusters providing the filesystems S3 uses are
designed, plus they use spinners possibly still for that tier, whereas S3
production servers are all going to be SSD for sure) is $0.004 per GB (USD)
per month, so that is 51.2 cents per month to store one snapshot at 128GB,
so it can still get expensive, since a day's worth of hourly snapshots
would be  ~$12 per day per month, so $360/mo, quite expensive imo, so
perhaps this won't work.

there is another tier, glacier "deep archive", which i have not seen before
(the last time i looked at this stuff was maybe end of february this year,
really, AWS does have a very rapid release cycle tho for new
products/features) which costs $0.00099 per GB per month, so about 12.6
cents per snapshot per month, ~$3 per day per month, so about $90 per
month, but that is also quite expensive.

of course, you could just do daily snapshots and cut the cost by a factor
of 24, so $15 per month for archive of daily snapshots in glacier and
$3.75/mo for daily snapshots in deep archive, i guess

disclaimer: i do not work for an AWS sales team xD

i like storage and i/o problems, though

On Mon, Aug 17, 2020 at 1:16 PM ella <[email protected]> wrote:

> Stefan -
>
> 128GB of memory to hold everything? that is not so much, i think we may be
> able to work around the limitstions of juat dumping it all in memory,
> perhaps.
>
>   have you calculated i/o speeds based on memory specs?  if so, please let
> me know which specs of memory you are calcualting with as,
> in my testing, high-end SSDs were not too shabby for lots of random
> read/write tasks, but i dont know what the data you are getting from
> wikipedia looks like and id have to confirm those results anyway since they
> were from testing i did about 5 years ago, i think we had an engineering
> sample of an nvme pci-e 16x card from micron and we were atocking intel
> enterprise SSDs via SATA 3.0 and we might have had an intel nvme, but they
> didnt take my opinion when it came time to order hardware ;
>
> so im looking at clustering with rdma or something since high-end NICs,
> too, can approach memory bus speeds or surpass them i guess, although this
> might be a HUGE, fundamental misunderstanding on my part, since memory
> speed on, say video cards, is measured in GB/s, but NICS are in Gb/s, so we
> can say 6.4GB/s for a latest nvidia card's memory bus compared to a 45Gb
> ROCe nic (may be prohibitvely expensive, unsure) its the same-ish, since
> 6.4GBps (bytes into bits) is 51.2Gbps, its close, and anyways that is GPU
> land.
>
> plus, there would be huge performance hits when switching to a network
> stack due to memory addressing vs net addressing, unless maybe u could work
> directly with hwaddressing somehow on the net stack?  not sure what that
> might look like and really just brings us back to IP land, maybe, idk i was
> never a network engineer so the OSI model is very faded at this point.
>
> i try to work with commodity hardware tho, since 1Gbps Nics are still far
> more common than 10Gbps Nics in consumer land in the USA since p much no
> one's home internet is faster than 1Gbps anyway.  i think even the local
> webhosting comoany hosts 20k-30k servers with god-knows how many webapps on
> lile 4-5 redundant 10Gbps links
>
> OORRRR you could avoid most of that and find a software solution, maybe.
> here is an idea i had:
>
> maybe consider a caching mechanism on the server where, say, a disk image
> containing a local copy of wikipedia is mounted read-only is updated by
> some other, separate process and is presented as an immutable, read-only
> volume for your application (wikify)'s threads to consume according to some
> set of rules (rules will be necessary beyond an acl, inorder to ensure data
> consistency on read) and rhen u can let the OS handle memory caching if you
> want.
>
> you could also look at sharding the dataset across multiple disks for
> better threading, but more expenseive and unnecessary optimization rn anyway
>
>
> lmk if you have any thoughts!
>
>
>
>
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/T6322565b7d29a2a0-M84c7bfd6bd548751277ca98e>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T6322565b7d29a2a0-M5a0b89e7d3cb7c2ce806253c
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to