Hey Analytics!

I'm working on updating the Wikitech Analytics documentation
<https://wikitech.wikimedia.org/wiki/Analytics> based on my new
understanding of the Data Lake. I've already clarified that there's no
separate thing called the "Data Warehouse" (other than some experiments
from 2015), but I still don't understand the difference between the Analytics
Cluster <https://wikitech.wikimedia.org/wiki/Analytics/Cluster> and the Data
Lake <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>.

>From what I learned yesterday, the Data Lake is everything stored in the
Hadoop cluster (including pageview, mediacounts, last-access, and edit
history data), even when it can't be usefully joined together.

But that seems to be the same thing as the Analytics Cluster ("the Hadoop
cluster and its related components"). Is it possible to pick one name
("Data Lake" or "Analytics Cluster") and stick with it? I promise you it'll
make the whole system much easier to understand for outsiders :)

-- 
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>,
product analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to