Re: [discovery] ElastiCon notes

aude Wed, 24 Feb 2016 11:44:06 -0800

wow! thank you for the detailed notes :-)

also, I see that https://www.elastic.co/elasticon/videos has some videos,
though not sure how complete those are.


Cheers,
Katie

On Tue, Feb 23, 2016 at 7:13 PM, Erik Bernhardson <
[email protected]> wrote:

> These are my notes, pretty much completely raw, from elasticon last week.
>
> ----------------------------------
> Scaling 5 nodes -> 1,000
> ----------------------------------
> significant terms query (expensive) may help with trending articles?
> preferred log ingestion path: app -> kafka -> logstash -> elasticsearch
> - es has features (node affinity) to have high power "hot" nodes and
>   lower power / spinning disk "warm" or "cold" nodes and ensure indices
>   end up on the right machines.
> - **run master nodes on same phyiscal hardware, separate JVM**
> -- dedicate cores to master node
> new node type: ingest
> - specialized for indexing
> - most common logstash filters implemented in library, used in
> elasticsearch
> - can enrich data on indexing
> - makes logstash not necessary for some workloads
>
> 10 nodes (128G memory, 32 threads, 1TB SSD) can handle 150k+ indexes per
> second for logging
> **make clusters a first class citizen in your application**
> **multi-tenant scaling -> multiple clusters**
> new features: cluster state diffs -> allows more machines/indices/shards
> per cluster
> - may help with our master timeout issues, but talked to some core devs
> and they wern't really sure. Preferred the idea to split into multiple
> independent clusters for multi-tenant use case
> - **more indices does matter for master complexity. All about the total
> number of shards**
> types are just a filter on special _type column
> new feature: columnar data store (doc values) on by default in 2.0
> new software: beats
> - is about data collection and shipping
> - may take on similar role as diamond, shipping to elasticsearch or
> logstash
> new goal: "release bonanza" release multiple products on same date to
> simplify compatability between products
> - es will go v2 -> v5, skipping between, to match other product versioning
> new tool for extracting graphs from elasticsearch data
> - can find strongly connected nodes, etc.
> - threat detection, possibly other itneresting things.
>
>
> ----------
> Lucene
> ----------
>
> segment
> - inverted index
> - norms
> - document store
> - column store in v4
> - k-d tree - faster structured search in v6
> -- replaces inverted index in certain limited circumstances
> -- old structured search is based on cells, optimizations are based around
> visiting as few cells as possible
> -- cells can have different sizes, 2x2, 3x3, (always square?) etc.
> -- k/d replacement - recursively  splits the search space in two aiming for
>    equal number of points on each side until each fraction of the search
> space has
>    one point in it.
> -- works similar to traditional databases with binary search trees
> -- elasticsearch will stop at 1000 docs per side rather than just 1
> -- it's a memory trade-off, the tree needs to be kept in memory at all
> times
> -- mike mccandles is a primary author
> -- Better in almost all aspects for spatial
> -- ElasticSearch 2.2 - geopoint searches
> -- for 1d numeric fields merge process is much faster, segments are
> presorted so it is only the final stage of a merge sort performed.
>
> BM25 by default in ES v5.0
> --
> - this is massively oversimplified, please read the code and the commit
> logs
> - common words are naturally discriminated
> - no more need to exclude stop words.
> -- doc freq contribution in bm25 goes all the way to 0, rather than 1.0 in
> tf-idf
> - term freq contribution
> -- bm25 saturates earlier. bounded at 2.2 by default
>
> Faster queries - two phase iteration
> - query can be divided into
> -- a fast approximation
> -- a (slower) match
> - phrase
> - approximation = conjuction
> -- match = position check
> - Geo polygon query
> -- approximation = points in cells that are in or cross the polygon
> -- match = check the point against the polygon
>
> Match cost API
> 1. approximate - (description:search AND description:engine) AND
> (body:postings AND body:list)
> ** 1. iterate body:positngs (1k)
> ** 2. Check description:engine (15k)
> ** 3. check description:search (200k)
> ** 4. check body:list (370k);
> 2. match - description: "search engine" body: "postings list"
>
> Other changes
> * better query-time synonyms
> - disk based norms
> - more memory efficient doc values
> - more disk efficient sparse doc values
> -- lucene 5.4
> - BooleanQuery simplification in rewrite
> -- for instance imagine you are searching a matching docs query
> -- filter  (constant score query)
> -- this is going to perform conjection (all docs that match both queries)
> -- lucene 5.5
> - bulk scorer specialization for MatchAllDocsQuery and MUST_NOT clauses
> -- matching docs iterator must return sorted by score
> -- re balancing prioritizations was causing slowdowns
> -- instead of doing by document, now doing it in batches of 2048
> -- a NOT b
> - improve file truncation detection
> -- leading cause of index corruption
> -- checksums added in 4.8 notice the error
> -- advances in 5.x add nice error message
>
> Also, SVN -> git
>
>
> Scaling log ingestion at ????
>
> 10-12 nodes, old DB servers. lots of memory, disk, fast cpu's
>
> - number of indexes doesn't matter, it's about the number of shards
> - stores 13 months worth of data
>
> Initial config
> - two jvm's per server. One "hot" ingest instance pointed at fast disks.
> one "warm" instance pointed at slower spinning disks for querying
> historical data
> - 384G ram. 60TB of disk, 96 hardware threads, etc.
>
> (initially) The bad:
> - inefficient use of hardware
> - limited number of nodes
> - larger amounts of data to replicate on node failure
>
> Thoughts:
> - run all independent instances? Run many instances on one node?
> - hardware overpowered for data nodes
>
> (reworked cluster config) The good:
> - Run elasticsearch instances completely independent. Separate yml config,
>   separate data directories, etc.
> - One hot ingest instance, 4+ warm instances
> -- uses ~50% of memory for JVM
> -- no matter how big your box is, keep jvm heap < 50%
> -- at 5 to 7 instances on 48 core machines (96 with HT) running 5
> instances does not need cpu affinity. Running 15 it does (but 15 is a bad
> idea)
> - increase nodes
> -- more resources for queries
> -- better distribution of the larger number of shards
> -- more shards supported, thus more indexes, thus more data available
> online
> - faster replication on node failure due to smaller data sets
>
> - increased peak query load possible by 10x (probably due to more jvm
> heap? didn't say)
> - # of shards per instance effects scaling, query patterns
> - requires creative allocation rules to keep primary and replica off of
> the same physical machine
> -- quite a few. Requires many additional tags as well.
> - get much more out of the hardware with smaller (~30G) heaps and more JVM
> instances
>
>
> The Bad: What did it cost us?
> - more complicated cluster management
> -- need to address each instance for all management
> -- need to setup proper allocation rules to protect data from physical
> node crash
> --- originally intended to define racks. Allows separating
> primary/replicas by rack
> --- reuse this to support many independent instances per machine
> - Non standard configuration means more difficult to get assistance from
> public resources and documentation
> -- (paid) Support is your friend here.
> - physical node loss means greater cluster impact
> -- When a physical node goes down 40T+ of data needs to re-replicate
> -- this takes about an hour in their network/disk configuration
> - higher cost of hardware for data nodes
> - more power and environment controls necessary
>
> The End: well not really we are still growing
> - capable of processing 5.75 billion events a day
> - peak is 2x of trough (as high as 10x)
> - 24-30 physical data nodes
> - 5 instances (1hot, 4 warm), lost of storage, fast cpus, lots of memory
> - resilient
> -- can handle hardware failure and/or node crashes
> -- can handle maintenance/upgrades
> - small footprint in DC
>
>
> Questions?
>
> - tested JVM heaps down to 24G, was not enough for ES caching. Optimal is
> just under 32G in his instances
> - main difference between hot and warm instances is how expensive the disk
> is
> - running es 1.4.4, testing 2.2 but have not finishes converting everything
> - nodes are not separated on the boxes (VM, docker, etc), one kernel
>
> Relevant config variables:
> node.name - to identify the specific instance. For consistency suggest
> naming the process, configuration files and node all the same
> node.server - to identify the physical server
> node.tag - optional - to be able to move shards around
> path.conf - to identify the unique configuration directory
> path.data
> path.logs
> path.work
> cluster.routing.allocation.same_shard.host: true - enable check to rpevent
> allocation of multiple instances of the same shard on a single host
>
>
> --------------------------------------------------------------------
> Stories from Support: Top Problems and Solutions
> --------------------------------------------------------------------
> "It Depends"
>
> The Big Issues
> 1. Fielddata (always single word)
> - most common problem.
> 60% of total heap can be taken up in 1.x and 2.x
> more likely to be used in 1.x
> - 100% of data nodes will be impacted
> - 2.x (Mostly) Solves This - Thanks to doc_values being on by default
> -- Wouldn't help for wikimedia use case, we have very few aggregations
> that would use doc values. We also have relatively small fielddata though
>
> Doc Values!
> - Columnar store of values
> - Written at index time to disk
> -- Adds an extra file next to the segment on disk
> -- takes more indexing time
> -- Means you don't have to un-invert the inverted index at query time
> -- Part of the reason JVM should not take more than 50% of memory
> - Levererages the operating system's filesystem cache
> - When upgrading from 1.x to 2.x you have to reindex to get the change
> -- You cannot enable doc values after the fact.
> ****
> - GET /_cat/fielddata?v is your friend
> -- Couple hundred megabytes per node.  Basically nothing.
> - Mostly for aggregates (as in, not useful at wikimedia)
> ****
> - analyzed strings do not currently support doc_values, which means that
> you must avoid using such fields for sorting, aggregating and scripting
> - analyzed strings are generally tokenized into multiple terms, which
> means that there is an array of values
> - with few exceptions (e.g., significant terms), aggregating against
> analyzed strings is not doing what you want
> - unless you want the individual tokens, scripting is largely not useful
> - Big improvements coming in ES 2.3 ("keyword" field)
> -- Redo not analyzed strings to allow them to use analyzers that produce
> single tokens
> - use multi fields to create analyzed and un-analyzed fields
>
> 2. Cluster state
> - every cluster state change  is sent to every node
> -- requires a lot of short lived, potentially large network messages
> -- gets worse with more nodes or indices
> -- mappings tend to be the largest portion
> - GET /_cluster/state?pretty
> - Not stored in memory as JSON, so this is just to give the idea (it's
> likely 5% of it ballpark)
> -- elastic1001 reports JSON as 25MB. 5% would be 1.25MB (tiny)
> -- a Gig would be devastating for any cluster size
> -- 100M for a 100 node cluster would be "reasonable"
> -- 100M for a 10 node cluster would be odd
> -- Worst seen (by presenter) was 4G
>
> ES 2.0 introduces cluster state diffs between nodes
> -- change become far more manageable and a large cluster state is no
> longer as problematic
> -- reducing your mapping size helps too
> -- Do not allow dynamic mappings in production
> - Do not use _types to separate data
> -- duplicating mapping if they are the same
> -- if they are different unnecessarily something
> -- prefer to create separate indices
> - Create a "type" field to do this for you
> - Prefer changes in bulk rather than one-by-one (allow changes to be
> batched)
> ** Create 900 new titlesuggest indices in one cluster operation?**
> -- there is no recommended size. Maybe 5MB payload size? Play around with
> it
>
>
> 3. Recovery
> - Changed dramatically in ES since 1.4
> - restarting a node or otherwise needing to replicate shards
> - terribly slow process
> - segment by segment
> - minor risk for currption pre-es 1.5 with sketchy network
> -- if 90% done and network drops there could be issues
> -- dont use 1.5
>
> 1.6 -> asynchronous allocation and synced flushing
> 1.7 -> delayed allocation and prioritized allocation
> 2.0 -> Cancelled allocation
> 2.1 -> prioritized allocation for replicas
>
> If you are not on 1.7.5 get there SOON.
>
>
> 4. Node Sizing
>
> - elasticsearch is a parallel processing machine
> - java can be a slow garbage collecting calculator
> - slow disks. The problem for every data store?
> -- Do not use SAN. Really. Don't do it
> - A few huge instances isn't best, keep JVMs <= 30G
>
> And how long is a piece of string?
>
> Memory
> - 50% of system ram to heap
> - Up to 30500M - no more or your heap loses optimizations!
> -- compressed pointers
>
> CPU
> - indexing tends to be CPU bound
> - At least 2 cores per instance
> -- more you give it the faster it can go (duh)
>
> IO
> - disks get hammered for other reasons, including write-impacting
> ****
> - translog in 2.0 fsyncs for every index operation
> -- major major change in 2.0
> ****
> - SSDs of Flash are always welcome
>
> 5. Sharding
> - ES 1.x defaults to 5 primary, 1 replica
> - ES 2.x defaults to 2 primary, 1 replica
> **
> - 50GB is MAX shard size. More for recovery than performance reasons
> **
> - Increase primaries for higher write throughput and to spread load
> - Shard because you need write capacity.
> - If you need better search response (latency) increase number of shards
> - Replicas are not backups. Rarely see a benefit with more than 1
> -- They are for HA, not backups
> -- They do not save you if "everything goes to hell"
> -- Snapshots are your friends
> --- Incremental, next snapshot gets only segments that are "new"
> -- Oversharding is becoming an issue
> --- 41k shards in a 10 node cluster is extreme
>
> Bad Queries
> - Deep pagination
> -- ES 2.0 has a soft limit on 10k hits per request. Linearly more
> expensive per shard
> -- Use scan and/or scroll API
> - Leading wildcards
> -- Equivalent to a full table scan (BAD!!)
> - Scripting
> -- Without parameters
> -- dynamically (inline)
>
> Aggregations
> - Search is always faster than aggregations
> - Dont aggregate when you want search
>
> Merge throttling
> - Disable it on SSD's
>
> Use bulk processing
> - Indexing 1 by 1 you will have a bad day (if trying to index many things)
>
> -------------------------------
> Securing Elasticsearch
> -------------------------------
>
> Pitfalls securing search - or why REST based security solutions rarely work
>
> Why would you?
> - Most of the clients are using HTTP protocol
> - Easy to implement using standard HTTP reverse proxy
> - Many elasticsearch operation are REST compliant
> -- PUT /index/type/doc_id  - indexes a document
> -- GET /index/type/doc_id - gets a document
> - Reverse proxy isn't enough, you also need firewalls
> -- (i don't think that's a big idea, at all. Use SSL)
> -- must establish security perimeter
>
> What does it get you?
> - Works great for authentication. Access/No access to everything
> - Fails for authorization
> -- Role based access to certain indices, certain part of indices, certain
> fields
>
> Limiting access to an index by using URL matching
> - Theoretically easy! Just setup the following URL matching patterns:
> -- GET /myindex/*/*
> -- PUT /myindex/*/*
> -- POST /myindex/*/*
>
> - But what about?
> - POST /myindex/type/_bulk
> -  /myindex1,myindex2/*/*
> - Could disable multi-index rest actions
> -- rest.action.multi.allow_multi.??? = none
>
> Search is not always limited to the index on the url
> Cross index search operation:
> - More like this query
> - Terms lookup mechanism in the terms query
> - percolation
> - you end up needing to whitelist/blacklist query patterns
>
> Be really careful with URL matching rules!
> - Documented exceptions on the elasticsearch website
>
> Document level security
> - A typical solution for document level access control
> -- Add an ACL token(s) to each document
> -- Crete an alias with ACL filters for each user
> -- Limit users access to their aliases
> - Works *most* of the time
> -- Aliases are not considered in
> --- Suggest API
> --- Children Aggs
> If you are trying to use aliases as security solution augment it WRT query
> types
>
> Filtered aliases are not a security mechanism
> - More like working together with other systems
>
> Field level security
> - A naive solution - filtering of source in response
> -- /_search?_source_exclude=secret_part.*
> -- This is not enough, If you can search it you have access to it. There
> is no exception to this.
> A field in the source can appear in multiple ways
> -- multi match, aggregations, etc. etc.
> -- Requires basically whitelisting queries
>
> Securing search is difficult problem
> - URL path is not a filter
> - Some operations are working on entire index
> - It's a search and analytics engine, which makes frequency analysis
> attack so much easier
> -- Lots of nice statistics like term frequency, etc.
>
> Is there a solution?
> - Move all security to business logic layer. Deny raw access
> - Move security to transport layer
> -- This is what shield (now xpack) does
> -- They believe that securing at the lowest level is the only way to
> secure elasticsearch
> - Hard to know from the REST api what will happen with a given request
> - Authorization is secured at the individual TransportAction level
> -- Prevents the user from accessing a terms lookup (for example) on an
> index they dont have
> - TLS encryption for inter-node transport and http
> - Message signing means even without TLS a node cannot join the cluster
> - Secured down to the lucene level - integrating with how elasticsearch
> reads the index
> - In a request with document level security the query defined on a role
> will get added to
>   the index reader. Documents that arn't allowed look deleted to the
> IndexReader
> - Field level security is similar. At the lucene level fields look like
> they don't exist
>
>
> ------------------------------------------------------------------------------------------------
> Boosting Search Accuracy with Engine Scoring and Predictive Analytics
>
> ------------------------------------------------------------------------------------------------
> Paul Nelson
> Chief Architect, Search Technologies
>
> (note to self: This will probably be advertising to hire them as
> consultants)
> (talked to paul after talk, he had previously talked with nik and is
> interested in working with us. I think what he really wants is to use our
> click through data to improve his own business case though. Tomasz will be
> in contact and see if we can work together).
>
> - Speaker has been involved in this since mid 90s
> -- NIST
> - Many of these ideas came from Trek, updated with big data and logs
>
> - Common customer complaint
> -- Q: Whats wrong?
> -- A: Our search is bad
> -- Q: How bad?
> -- A: Bad
> -- Q: Scale 1-10?
> -- A: 8? 9? Lets call it 9.23
> -- (completely bogus, means nothing without statistically based numbers)
>
> - Golden query set
> -- Key documents
> -- Just because it's called golden doesn't mean its any good
> - Top 100 / Top 1000 queries analysis
> -- May not represent actual usage of your system
> -- Taking a random selection can be more useful. For example one customer
> 40% of queries were human names. Large variance didn't show up in top 1000
> queries.
> -- The long tail could be 85%+ of queries
> - Zero result queries
> - Abandonment rate
> - Queries with click
> - Conversion
>
> These all have a problem: You need to put your search engine into
> production to compute them which hurts your bottom line.
> - (if you are ecommerce or advertising)
>
> What are we trying to achieve?
> * Reliable metrics for search accuracy
> -- repeatable
> * Can run analysis off-line
> -- Does not require production deployment (!)
> * Can accurately compare two engines
> -- Different technologies
> -- Different versions of same engine
> -- Different scoring on same engine
> * Runs quickly = agility = high quality
> -- Iterating engine quickly is key
> * Can handle different user types / personalization
> -- Broad coverage
> -- Modify relevancy scoring based on source website, "cluster" user is in
> * Provides lots of data to analyze what's going on
> -- Data to decide how best to improve the engine
>
> Leverage logs for accuracy testing
> Query logs, Click Logs -> Engine scoring framework -> Search engine under
> evaluation
> * engine scores - have one (or two, or three) top level numbers that give
> your engine a score for the query set
> * other metrics and histograms - generate useful metrics, such as a
> histogram of where recorded click through were in the final positions
> * scoring database - Record historical scores, and what changes led to
> those scores. Re-review prior changes after making new changes. They might
> not be as useful any more!
>
> From Queries -> Users
> * User by User metrics
> -- change in focus
> -- Queries don't matter. People matter. It does *not* matter if a query
> was the right answer. It matters if the user got their right answer (maybe
> two queries? Not the end of the world)
> * Group activity by session and/or user
> -- call this an "activity set"
> -- Merge session and users
> * Use Big Data to analyze *ALL* users
> -- There are no stupid queries and no stupid users
> -- Overall performance based on the experience of the users
>
> Engine Score
> - Group activity by session and/or user (Queries and Clicks)
> -- Trying to create a model for the user. Everything the user hs found
> useful
> Determine "relevant" documents
> -- Lots of people don't have user ids, only sessions
> -- What did the user view? Add to cart? Purchase?
> -- Did the search engine return what the user ultimately wanted?
>
> * Determine engine score per user per query
> -- Σ power(FACTOR, position) * isRelevant[user,
> searchResult[Q,position].DocID]
> -- Evaluated for each user's point of view
> -- (Note: Many other formula possible, MRR, MAP, DCG, etc.)
> -- Speaker happens to like this one. Some algo's prefer top results, some
> prefer results "on page"
> * Average score for all user queries = user scores
> * Average scores across all users = final engine score
> -- Important to average per user (or per session) and not per query
> -- Prevent single user with 99 queries from changing the score
>
> Off Line engine Analysis
> - searchRersult[Q,position].DocID
> -- Done by re-executing the users queries offline
>
> Continuous improvement cycle
> Modify engine -> execute queries -> compute engine score -> evaluate
> results -> back to beginning
> * Kind of like sports, you can have a personal best engine score
> * Scoring never stops, generating one score is not enough. Generate scores
> and track over time
>
> What else can we do with engine scoring?
> - predictive analytics
> - The brutal truth about search engine scores
> -- They are not based on science.
> -- Random ad-hoc formula put together
> - BM25 invented in 1980. Not state of the art. Predictive analytics is
>
> Use Big Data to predict relevancy
> Document Signals, Comparison Signals, Query Signals, User Signals
>
> The score predicts probability of relevancy
> Value from 0 -> 1
> * Can be used for threshold processing
> * All documents too weak? Try something else!
> * Can combine results from different sources / constructions together
> - Identifies what's important
> -- Machine learning optimizes for parameters
> -- identifies the impact and contribution of every parameter
> - If a parameter does not improve relevancy -> REMOVE IT!!!
>
> Come out of the darkness
> - Ultimate message is use science
>
>
> DEMO
> - Note time based, just a list of queries and clicks
> - Don't have to tie clicks to users
>
> Q/A
>
> - Single relevancy score not as useful - Build relevancy lab to have
> multi-dimensional scores
>
> - td/idf may be a signal, but the resulting formula usually replaces
> tf/idf completely
> -- tf/idf only used to surface likely documents
>
> -------------------------
> NY Times Search
> -------------------------
>
> explicit search, formal user queries, used but not the main use case
> implicit search powers much of nytimes.com
>
> Ingestion
> ---------
> Few updates - A few thousand updates per day
> Sources - CMS, Legacy systems, file based archives, OCR
> Latency - <1s, Low latency is important. Especially for implicit search.
>
> Content typically doesn't show up on website until it's indexed
>
> Typical Use Cases
> -----------------
> - not comprehensive
> - People search for something they read. They read an article they want to
> read it again
> -- Only 1 correct result.
> -- Goal (not accomplished): recognize these queries and give a single,
> correct, result
> - Find something you wrote
> -- Less common, but happens surprisingly often
> - Find reviews for movies/books
> -- Typically has a single correct result
> - Find recipes
> -- Similar, but more than a single correct result
> - "Manual sharing"
> -- Instead of sending a link to someone you tell them you read an article
> on xyz and they search for it
> - "Casual" news search
> -- Looking for information on recent events
> - Serious research
> -- More interesting
> -- Makes up a big amount of queries on the website
> -- Typically journalists or historians
>
> Why not just use google?
> 1. Keep the customers on site
> -- Google risks the customer goes to a different site
> 2. There is no google for native apps
> -- Native apps need to provide good search
> -- Because they have multiple apps that do slightly different things they
> need special cases
> 3. We know our content better
> -- Most important
> -- Have editorial teams, taxonomy teams. They know which article is
> relevant for a specific case
> -- Elsaticsearch has allowed them to tune this exactly how they want
>
> Search should be targeted at their most advanced readers
> - A movie review can be found on google
> - If you want to know the 10 most important articles on syria, NY Times
> can do that much better
> -- Not there yet, lots of work still to do but that is the direction they
> are headed
>
> Our current Setup
> -----------------
> - The search stack runs entirely on AWS
> -- Case for NYTimes website
> - Deployed using an in-house infrastructure tool
> - Nagios and New Relic for monitoring
> - Sumo Logic for log management and analytics
>
> System Setup
> - Two full production clusters
> -- Not just for elasticsearch, but the whole stack (ingestion pipeline,
> etc)
> - DNS based failover
> - Also used for maintenance/major changes/reindexing
> - 16 node ES clusters in production (m1.xlarge)
>
> CMS->SQS->Handler->Normalization->Merge->Indexer->Elasticsearch->Search
> API<->Semantic platform
> - Many intermediate layers read/write data from S3
> - Beanstalk handles queueing(i think?). Replacing with kafka
>
>
> Normalization
> - Large number of rules defining how to extract fields from different data
> sources
> - Constructs canonical document which is same for all data sources
> - Canonical document pushed into ES
>
> Merge
> - Ensures that only the latest version is indexed
>
> Indexer
> - Pushes documents into elasticsearch and mongodb
>
> Search API
> - powered by elasticsearch and mongodb
> - For each search result looks up full document from mongodb and returns
> that to client
>
> Two clusters makes maintenance work easy.
> - The way the handler works makes it relatively easy to handle the
> disparate formats
>
> Number of issues with this setup:
> - Too many moving parts
> - Too many things that can go wrong
> - Difficult with this setup to index new fields
> -- New fields are added quite frequently
> -- Have to go back to normalization, add new fields, reindex everything
> -- Many times requires going all the way back to CMS
> -- Simple task that shouldn't be this hard
> - Despite the fact they have two clusters they don't know if the offline
> cluster is still working
> -- Cache goes cold
> -- Have to warm up cache before putting online
> -- Offline cluster doesn't receive production traffic
> -- Only discover failure when they try to fail over
> - Hard to run a complex setup like this on a laptop
> -- Want to run entire pipeline locally
> - Overall This is a setup that is so complicated that they spend a good
> amount of their time doing operations work instead of search
> --Seen this issue in many places
>
> Future Work - Simplify
> 1. Full document in elasticsearch
> - store and index full documents in elasticsearch
> -- Not done previously for historical reasons of support for document
> sizes (alt search engine?)
> - No external lookup necessary in the API, no MongoDB
> - Demand driver API - clients decide what they want.
> - Not in production yet, but some tests show this is better
>
> 2. Replay
> - Use kafka
> - All content will go through kafka
> - Kafka will persist all content, source-of-truth for search
> - Easy to create new search clusters through Kafka replay
> - Goal: Never upgrade cluster. Boot new cluster, replay kafka to populate,
> switch over and kill old cluster
>
> 3. Keep all clusters busy
> - All production traffic will be replayed from active production cluster
> to standby production and to staging
> - Uses gor
> - Makes sure that standby cache is always warm
> - Changes to staging will be exposed to production traffic
> - If standby cluster crashes they will know
>
> 4. Virtualize
> - Vagrant boxes for everything
> - Make it easy to run full pipeline locally.
> - Provision vagrant boxes using same system as production servers
> -- *exactly* the same. No differences between local and prod per-box
> configuration
>
> CMD-Kafka->logstash->elasticsearch->Search API
> - No S3. No beanstalk. Much simpler.
>
> Conclusion and comments
> 1. Figure out how you can be better than Google
> -- IF you don't think you can be better let google do it
> -- Know your users
> -- Search should be an integrated part of the product
> --- Search should not be an afterthought
> --- Plan it from the beginning
> -- Be consistent across devices and platforms
> -- Focus on where you can make a difference for the end user
>
> 2. Infrastructure is hard
> Think about how to
> - do deployments
> -- Far too often people setup search clusters without planning how to
> deploy new versions
> -- Know how you can easily upgrade
> - do upgrades
> - reindex content
> - change the way you normalize data
> - create new search clusters
> ... and make it easy.
>
> Things to do
> - Put your infrastructure in version control
> -- Nothing should be done from command line to setup infrastructure
> -- When everything crashes, and some day it will, it needs to be easy
> - Vagrant everything
> - Have immutable servers, replace instead of changing
> -- Only really possible on cloud based, raw hardware requires different
> approach
> - Make deploying a new stack easy and automatic
>
> Is elasticsearch for text search?
> - Background for this question is they attended a one day elastic
> conference in new york
> -- None of the talks were about text search
> -- This week is very similar
> -- Looked over new features in ES2.0. Categorized into
> analytics/search/other stuff(mostly cluster)
> -- Found 6 issues out of 119 dealing with search. 5%.
> -- Skeptical that elastic wants to focus on text search as an important
> use case
> -- Those features are not revolutionary in any way, just regular tweaks
>
> New things necessary in elasticsearch for text search:
> - Query log handling
> -- Need to know what users are searching for and what they are doing
> -- Completely missing from elasticsearch
> -- Considers it a big problem
> -- Elasticsearch has plenty of ways to analyze logs. But nothing that logs
> and ingests it's own search history
> - Easier relevancy tuning
> -- Difficult to change things. Getting the right score for one document
> without changing all other searches
> -- Needs tooling. Providing a set of benchmark queries
> -- When you change things for one query, how does that effect other
> queries?
> - Improved linguistics
> -- No built in language detection
> -- No morphologies - lemitization
> -- No compounding
> -- No entity extraction
> - Really wants to see some leadership from elastic on the text search space
>
> Q/A:
> Optimizations around analyzers?
> - Not much. Most is done pre-indexing by adding keywords and custom scoring
> Primary data store?
> - Source of truth is custom CMS
> Moving away from AWS?
> - not trying to, just want to make sure it's possible (using kafka instead
> of SQS, etc.)
> Follow up on latency, time from publish to shown on site?
> - Less than 1s
>
>
>
> -------------------------------------------------------
> Geospatial data structures in elasticsearch and lucene
> -------------------------------------------------------
>
> Geo field types
> -geo point
> -- used to have 8 or 9 paramteres, as of ES 2.2 only 3 parameters necesary
> -- other still there, but deprecated and being removed
> -- type, geohash_prevision, geohash_prefix, ignore_malformed all that is
> left
> -- only type, ignore_malformed is being kept
> - geo shapes
> -- used to have 12 parameters, many expert only
> -- going forward: type, ignore_malformed, tree, precision,
> distance_error_pct, orientation, ???_only
> -- orientation going away. standards define a standard: right hand rule.
> -- es5.0 iso complient, will reject uncomplient
> -- points only allows short circuit quad tree logic, optimizes for point
> only types
> --- points only vs points index is 10-15% better performance
> --- only geo shape no points only,
> OGC simple feature access
> ISO Geographic information - spatial schema 19107:2003
>
> Geo indexing
> GeoPointField change in lucene 5.4
> - create a quad tree encoding, first bit is world level lat, next bit
> world level longitude
> -- rotates back and forth each two bits
> -- Space filling curve, dimensional reduction
> -- Their implementation is Z-curve, morton curve. Nearest neighbors are
> not necessarily so close.
> -- investigation hilbert curve. Current is 4 operations to encode, hilbert
> takes longer. Tradeoff of accuracy/locality vs performance
> -- precision step parameter effects size of index. More precision -> more
> terms
> -- "optimal" precision step for smallest index possible. Currently 9
>
> 2.1 -> 2.3 large change in performance of geo. ~30% index size, 12% heap,
> 40% time (indexing or query?)
>
> 5.0 -> geo shapes and geo points merged as geo @experimental
> - balanced k-d trees
>
> Geo search
>
> Geo Aggrgations
>
> ---------
> BM25
> ---------
>
> A free text search is very inaccurate description of our information need
> What you want
> - quick learner
> - works hard
> - reliable
> - enduring
> - ...
>
> But you type: "hard-working, self-motivated, masochist"
>
> The purpose of this talk
> - know the monster, understand what the parameters of BM25 do
> - know why it has the label "probabilistic"
> - mathematical concepts behind it: more for your entertainment
> - be convinced that switching to BM25 is the right thing to do
> -- hardest part of this talk
> - be able to impress people with you in depth knowledge of probabilistic
> scoring
>
> The current default - TF/IDF
> - search in self-description of application for these words
> - we want to order applications by their relevance to the query
>
> Evidence for relevance - term frequencies
> - Use term frequencies in description, title, etc.
>
> Major tweaks:
> - term freq: more is better
> -- the more often it occurs the more relevant this document is
> -- square root of the term frequency. Sum it up
> -- increases non linearly
> - inverse document frequency
> -- common words are less important
> -- a, is, the, etc.
> -- also non-linear: multiple tf with idf and sum that
> - long documents with same tf are less important: norm
> -- also non-linear, idf is more L shaped than this though
>
> Bool query and the coord-factor
> - doc a: holiday: 4, china: 5
> - doc b: holiday: 0 china: 15
> - Coord factor: reward document a with both terms matched
> -- basically a hack
>
> TF/IDF
> - successful since the beginning of lucene, since first release
> - well studied
> - easy to understand
> - one size fits most
>
> So...can we do better?
> - It is somewhat a guess
> - sum square root of term frequencys, which was a guess of an original
> developer
> -- ad hoc
>
> Probabilistic ranking and how it led to BM25
> The root of BM25: probability ranking principle (abridged)
> -"if retrieved documents are ordered by decreasing probability of
> relevance on the data available, then the system's effectiveness is the
> best that can be obtained for the data."
> - Chance to deploy prior knowledge of relevance and scoring to make
> results better
> - if mapped to mathematical framework can take advantage of mathemticians
> past proofs
>
> Getting into math:
> 1. Don't care how much heap, how much cpu's. Get to that later, forget
> about it now
> 2. Think of yourself as super smart, beyond 2 standard deviations
>
> Estimate relevancy:
> - simplification: relevance is binary!
> - get a dataset queries - relevant / irrelevant documents
> -- use that to estimate relevancy
> - how to use this to actually get a better scoring?
> - P(R=1|d,q)
> -- P(A|B) = probability of A given B, R = relevancy (1/0), d = document
> - For each document, query pair - what is the probability that the
> document is relevant
> -- Need some sort of matrix, document vs query
> ...here be math...
> - paper: The probabilistic relevance  framework: bm25 and beyond
> - stephen robertson and hugo zaragoza
> ..and we get to..
> - W(d) = sum log( P(..)P(..) / P(..)P(..) )
>
> BM25: how to estimate all these probabilities
> - two assumptions
> -- the binary independence model - a dramatic but useful simplification
> -- query term occurs in a document or doesn't, we don't care how often
>
> - Robertson/Sparck Jones weight
> -- another equation
> -- assumes unlimited supply of interns
> - simplified
> -- still use robertson/sparck jones weight but assume that the number of
> relevant documents is negligible (R=0,r=0)
>
> - BM25 has S shape, goes to zero. tf/idf is L shaped
>
> - In TF/IDF the more often the term occurs the better
> - but, is a document about a term just because it occurs a certain number
> of times?
> - this property is called eliteness
> -- in the literature
>
> Example for "eliteness"
> - "tourism"
> - Look at wikipedia: Many documents are about tourism
> - Many documents contain the word tourism - but are about something
> completely different, like for example just a country
>
> Can we use prior knowledge on the distribution to ...
>
> Eliteness as Poisson Distribution
> Two cases:
> - document is not about the term
> - document is about the term
>
> How to estimate this?
> - father data on eliteness for term
> - many term frequencies -> do for many documents
>
> How relevance ties into that
> - suppose we knew the relationship of  ???
>
> ....
>
> Tf saturation curve
> - limits influence of tf
> - allows to tune influence by tweaking k
> - major difference from tf to bm25
>
> So...we assume all documents have same length?
> - poisson distribution: assumes a fixed length of documents
> - but they don't have that (most of the time)
> - we have to incorporate this too!
> - scale tf by it like so: (long equation)
> -- includes interpolation between 1 and document length over average
> document length
>
> Influence of b: length weighting
> - tweak influence of document length
> - similar to tf/idf, depending on b value
>
> Is BM25 probabilistic?
> - many approximations
> - really hard to get the probabilities right even with unlimited data
> -- mathematicians have been trying to solve this for years, no success
> - BM25 is "inspired" by probabilistic ranking
>
>
> -- 1992 Trec-2 took a "leap of faith" (?)
> -- 1993 Trec-3 BM25 final!
> -- 1999 First lucene release (tf/idf)
> -- 2011 Pluggable similarities + BM25 in lucene
>
> So..will I get better scoring with BM25?
>
> Pros with the frequency cutoff:
> - TF/IDF: common words can still influence score
> - BM25: limits influence of term frequency
> -- less influence of common terms
> -- no more coord factor!
> -- check if you should disable coord for bool queries?
> --- index.similarity.default.type: BM25
>
> Other benefits:
>
> - parameters can be tweaked. Not sure if many people actually need to. You
> need data to measure if scoring is getting better or not. Do not tweak
> without data
> To update:
> - close index
> - update mapping (or settings)
> - re-open index
>
> Mathematical framework to include non-textual features
> - go read the paper
>
> A warning: lower automatic boost for short fields
> - With TF/IDF: short fields (title,...) are automatically scored higher
> - BM25: Scales field length with average
>
> -- field length treatment does not automatically boost short fields (you
> have to explicitly boost)
>
> Is BM25 better?
> - literature suggests so
> - challenges suggests so (TREC, ...)
> - Users say so
> - Lucene developers say so
> - Konrad Beiske says so: "BM25 vs Lucene default similarity"
> - But: It depends on the features of your corpus.
> - Finally: You can try it out now! Lucene stores everything necessary
> already
>
> Useful literature:
> - Manning et al, Introduction to Information rerieval
> - Robertson and Zargoza, The probabilistic Relvance Framework: BM25 and
> Betond
> - Robertson et al, Okapi at TREC-3
>
> ----------------------------
> Quantitative Cluster Sizing
> ----------------------------
>
> 1. Understanding why "it depends"
> Factors
> - Size of shards
> - Number of shards on each node
> - Size of each document
> - Mapping configuration
> -- which fields are searchable
> -- automatic multi-fields
> -- whether messages and _all are enabled
> - Backing server capacity (SSD vs HD, CPU, etc.)
>
> Your organization requirements / SLAs
> - retention period of data
> - ratio and quantity of index vs search
> - nature of use case
> - continuous vs bulk indexing
> - kinds of queries being executed
> - desired response time for queries that are run frequent vs occasionally
> - required sustained vs peak indexing rate
> - budget and failure tolerance
>
> Let's try to determine
> - How much disk storage will N documents require
> - When is a single shard too big for my requirements
> -- When will that shard reach a point where the search queries will no
> longer be within SLA
> - How many active shards saturate my particular hardware
> -- Keep putting shards on a node? Back it down
> - How many shard/nodes will i need to sustain X index rate and Y search
> response
>
> 2. Sizing methodology
> - run 4 experiements
> - 1. Determine various disk utilization
> -- Use a single node cluster with one index
> -- 1 primary, 0 replica
> -- Index a decent amount of data (1GB or about 10 million docs)
> -- Calculate sotrage on disk
> --- Mostly for logging use case, before and after _forcemerge
> ---
> -- Repeat the above calculations with different mapping configurations
> --- _all both enabled and disabled
> --- settings for each field
> - 2. Determine breaking point of a shard
> -- Use a single node cluster with one index: 1 primary, 0 replica
> -- Index realistic data and use realistic queries
> -- Plot index speed and query response time
> --- Build some dashboards. Latency increases with shard size
> -- Determine where point of diminishing returns is for your requirements
>
> - 3. Determine saturation point of a node
> -- USe a single node cluster with one index
> --- 2 primary, 0 replica
> --- repeat experiment two to see how performance varies
> --- keep adding more shards to see when point of diminishing returns occurs
>
> - 4. Test desired configuration on small cluster
> -- Configure small representative cluster
> -- Add representative data volume
> -- Run realistic benchmarks:
> --- Max indexing rate
> --- Querying across varying data volumes
> --- Benchmark concurrent querying and indexing at various levels
> -- Measure resource usage, overall docs, disk usage, etc.
>
> 3. Scenario and experiment results
> - Practical Sizing Example
> -- show some patterns, draw some conclusions about how it works
> -- no numbers, all your numbers will be different anyways
>
> - Graph latency, record count, shard size
> -- Latency increases with record count proportionally
>
> - ... lots of stuff about log ingestion use case ...
>
> 4. Interpreting results and expanding to other scenarios
>
> ...
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>


-- 
@wikidata

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] ElastiCon notes

Reply via email to