wow! thank you for the detailed notes :-) also, I see that https://www.elastic.co/elasticon/videos has some videos, though not sure how complete those are.
Cheers, Katie On Tue, Feb 23, 2016 at 7:13 PM, Erik Bernhardson < [email protected]> wrote: > These are my notes, pretty much completely raw, from elasticon last week. > > ---------------------------------- > Scaling 5 nodes -> 1,000 > ---------------------------------- > significant terms query (expensive) may help with trending articles? > preferred log ingestion path: app -> kafka -> logstash -> elasticsearch > - es has features (node affinity) to have high power "hot" nodes and > lower power / spinning disk "warm" or "cold" nodes and ensure indices > end up on the right machines. > - **run master nodes on same phyiscal hardware, separate JVM** > -- dedicate cores to master node > new node type: ingest > - specialized for indexing > - most common logstash filters implemented in library, used in > elasticsearch > - can enrich data on indexing > - makes logstash not necessary for some workloads > > 10 nodes (128G memory, 32 threads, 1TB SSD) can handle 150k+ indexes per > second for logging > **make clusters a first class citizen in your application** > **multi-tenant scaling -> multiple clusters** > new features: cluster state diffs -> allows more machines/indices/shards > per cluster > - may help with our master timeout issues, but talked to some core devs > and they wern't really sure. Preferred the idea to split into multiple > independent clusters for multi-tenant use case > - **more indices does matter for master complexity. All about the total > number of shards** > types are just a filter on special _type column > new feature: columnar data store (doc values) on by default in 2.0 > new software: beats > - is about data collection and shipping > - may take on similar role as diamond, shipping to elasticsearch or > logstash > new goal: "release bonanza" release multiple products on same date to > simplify compatability between products > - es will go v2 -> v5, skipping between, to match other product versioning > new tool for extracting graphs from elasticsearch data > - can find strongly connected nodes, etc. > - threat detection, possibly other itneresting things. > > > ---------- > Lucene > ---------- > > segment > - inverted index > - norms > - document store > - column store in v4 > - k-d tree - faster structured search in v6 > -- replaces inverted index in certain limited circumstances > -- old structured search is based on cells, optimizations are based around > visiting as few cells as possible > -- cells can have different sizes, 2x2, 3x3, (always square?) etc. > -- k/d replacement - recursively splits the search space in two aiming for > equal number of points on each side until each fraction of the search > space has > one point in it. > -- works similar to traditional databases with binary search trees > -- elasticsearch will stop at 1000 docs per side rather than just 1 > -- it's a memory trade-off, the tree needs to be kept in memory at all > times > -- mike mccandles is a primary author > -- Better in almost all aspects for spatial > -- ElasticSearch 2.2 - geopoint searches > -- for 1d numeric fields merge process is much faster, segments are > presorted so it is only the final stage of a merge sort performed. > > BM25 by default in ES v5.0 > -- > - this is massively oversimplified, please read the code and the commit > logs > - common words are naturally discriminated > - no more need to exclude stop words. > -- doc freq contribution in bm25 goes all the way to 0, rather than 1.0 in > tf-idf > - term freq contribution > -- bm25 saturates earlier. bounded at 2.2 by default > > Faster queries - two phase iteration > - query can be divided into > -- a fast approximation > -- a (slower) match > - phrase > - approximation = conjuction > -- match = position check > - Geo polygon query > -- approximation = points in cells that are in or cross the polygon > -- match = check the point against the polygon > > Match cost API > 1. approximate - (description:search AND description:engine) AND > (body:postings AND body:list) > ** 1. iterate body:positngs (1k) > ** 2. Check description:engine (15k) > ** 3. check description:search (200k) > ** 4. check body:list (370k); > 2. match - description: "search engine" body: "postings list" > > Other changes > * better query-time synonyms > - disk based norms > - more memory efficient doc values > - more disk efficient sparse doc values > -- lucene 5.4 > - BooleanQuery simplification in rewrite > -- for instance imagine you are searching a matching docs query > -- filter (constant score query) > -- this is going to perform conjection (all docs that match both queries) > -- lucene 5.5 > - bulk scorer specialization for MatchAllDocsQuery and MUST_NOT clauses > -- matching docs iterator must return sorted by score > -- re balancing prioritizations was causing slowdowns > -- instead of doing by document, now doing it in batches of 2048 > -- a NOT b > - improve file truncation detection > -- leading cause of index corruption > -- checksums added in 4.8 notice the error > -- advances in 5.x add nice error message > > Also, SVN -> git > > > Scaling log ingestion at ???? > > 10-12 nodes, old DB servers. lots of memory, disk, fast cpu's > > - number of indexes doesn't matter, it's about the number of shards > - stores 13 months worth of data > > Initial config > - two jvm's per server. One "hot" ingest instance pointed at fast disks. > one "warm" instance pointed at slower spinning disks for querying > historical data > - 384G ram. 60TB of disk, 96 hardware threads, etc. > > (initially) The bad: > - inefficient use of hardware > - limited number of nodes > - larger amounts of data to replicate on node failure > > Thoughts: > - run all independent instances? Run many instances on one node? > - hardware overpowered for data nodes > > (reworked cluster config) The good: > - Run elasticsearch instances completely independent. Separate yml config, > separate data directories, etc. > - One hot ingest instance, 4+ warm instances > -- uses ~50% of memory for JVM > -- no matter how big your box is, keep jvm heap < 50% > -- at 5 to 7 instances on 48 core machines (96 with HT) running 5 > instances does not need cpu affinity. Running 15 it does (but 15 is a bad > idea) > - increase nodes > -- more resources for queries > -- better distribution of the larger number of shards > -- more shards supported, thus more indexes, thus more data available > online > - faster replication on node failure due to smaller data sets > > - increased peak query load possible by 10x (probably due to more jvm > heap? didn't say) > - # of shards per instance effects scaling, query patterns > - requires creative allocation rules to keep primary and replica off of > the same physical machine > -- quite a few. Requires many additional tags as well. > - get much more out of the hardware with smaller (~30G) heaps and more JVM > instances > > > The Bad: What did it cost us? > - more complicated cluster management > -- need to address each instance for all management > -- need to setup proper allocation rules to protect data from physical > node crash > --- originally intended to define racks. Allows separating > primary/replicas by rack > --- reuse this to support many independent instances per machine > - Non standard configuration means more difficult to get assistance from > public resources and documentation > -- (paid) Support is your friend here. > - physical node loss means greater cluster impact > -- When a physical node goes down 40T+ of data needs to re-replicate > -- this takes about an hour in their network/disk configuration > - higher cost of hardware for data nodes > - more power and environment controls necessary > > The End: well not really we are still growing > - capable of processing 5.75 billion events a day > - peak is 2x of trough (as high as 10x) > - 24-30 physical data nodes > - 5 instances (1hot, 4 warm), lost of storage, fast cpus, lots of memory > - resilient > -- can handle hardware failure and/or node crashes > -- can handle maintenance/upgrades > - small footprint in DC > > > Questions? > > - tested JVM heaps down to 24G, was not enough for ES caching. Optimal is > just under 32G in his instances > - main difference between hot and warm instances is how expensive the disk > is > - running es 1.4.4, testing 2.2 but have not finishes converting everything > - nodes are not separated on the boxes (VM, docker, etc), one kernel > > Relevant config variables: > node.name - to identify the specific instance. For consistency suggest > naming the process, configuration files and node all the same > node.server - to identify the physical server > node.tag - optional - to be able to move shards around > path.conf - to identify the unique configuration directory > path.data > path.logs > path.work > cluster.routing.allocation.same_shard.host: true - enable check to rpevent > allocation of multiple instances of the same shard on a single host > > > -------------------------------------------------------------------- > Stories from Support: Top Problems and Solutions > -------------------------------------------------------------------- > "It Depends" > > The Big Issues > 1. Fielddata (always single word) > - most common problem. > 60% of total heap can be taken up in 1.x and 2.x > more likely to be used in 1.x > - 100% of data nodes will be impacted > - 2.x (Mostly) Solves This - Thanks to doc_values being on by default > -- Wouldn't help for wikimedia use case, we have very few aggregations > that would use doc values. We also have relatively small fielddata though > > Doc Values! > - Columnar store of values > - Written at index time to disk > -- Adds an extra file next to the segment on disk > -- takes more indexing time > -- Means you don't have to un-invert the inverted index at query time > -- Part of the reason JVM should not take more than 50% of memory > - Levererages the operating system's filesystem cache > - When upgrading from 1.x to 2.x you have to reindex to get the change > -- You cannot enable doc values after the fact. > **** > - GET /_cat/fielddata?v is your friend > -- Couple hundred megabytes per node. Basically nothing. > - Mostly for aggregates (as in, not useful at wikimedia) > **** > - analyzed strings do not currently support doc_values, which means that > you must avoid using such fields for sorting, aggregating and scripting > - analyzed strings are generally tokenized into multiple terms, which > means that there is an array of values > - with few exceptions (e.g., significant terms), aggregating against > analyzed strings is not doing what you want > - unless you want the individual tokens, scripting is largely not useful > - Big improvements coming in ES 2.3 ("keyword" field) > -- Redo not analyzed strings to allow them to use analyzers that produce > single tokens > - use multi fields to create analyzed and un-analyzed fields > > 2. Cluster state > - every cluster state change is sent to every node > -- requires a lot of short lived, potentially large network messages > -- gets worse with more nodes or indices > -- mappings tend to be the largest portion > - GET /_cluster/state?pretty > - Not stored in memory as JSON, so this is just to give the idea (it's > likely 5% of it ballpark) > -- elastic1001 reports JSON as 25MB. 5% would be 1.25MB (tiny) > -- a Gig would be devastating for any cluster size > -- 100M for a 100 node cluster would be "reasonable" > -- 100M for a 10 node cluster would be odd > -- Worst seen (by presenter) was 4G > > ES 2.0 introduces cluster state diffs between nodes > -- change become far more manageable and a large cluster state is no > longer as problematic > -- reducing your mapping size helps too > -- Do not allow dynamic mappings in production > - Do not use _types to separate data > -- duplicating mapping if they are the same > -- if they are different unnecessarily something > -- prefer to create separate indices > - Create a "type" field to do this for you > - Prefer changes in bulk rather than one-by-one (allow changes to be > batched) > ** Create 900 new titlesuggest indices in one cluster operation?** > -- there is no recommended size. Maybe 5MB payload size? Play around with > it > > > 3. Recovery > - Changed dramatically in ES since 1.4 > - restarting a node or otherwise needing to replicate shards > - terribly slow process > - segment by segment > - minor risk for currption pre-es 1.5 with sketchy network > -- if 90% done and network drops there could be issues > -- dont use 1.5 > > 1.6 -> asynchronous allocation and synced flushing > 1.7 -> delayed allocation and prioritized allocation > 2.0 -> Cancelled allocation > 2.1 -> prioritized allocation for replicas > > If you are not on 1.7.5 get there SOON. > > > 4. Node Sizing > > - elasticsearch is a parallel processing machine > - java can be a slow garbage collecting calculator > - slow disks. The problem for every data store? > -- Do not use SAN. Really. Don't do it > - A few huge instances isn't best, keep JVMs <= 30G > > And how long is a piece of string? > > Memory > - 50% of system ram to heap > - Up to 30500M - no more or your heap loses optimizations! > -- compressed pointers > > CPU > - indexing tends to be CPU bound > - At least 2 cores per instance > -- more you give it the faster it can go (duh) > > IO > - disks get hammered for other reasons, including write-impacting > **** > - translog in 2.0 fsyncs for every index operation > -- major major change in 2.0 > **** > - SSDs of Flash are always welcome > > 5. Sharding > - ES 1.x defaults to 5 primary, 1 replica > - ES 2.x defaults to 2 primary, 1 replica > ** > - 50GB is MAX shard size. More for recovery than performance reasons > ** > - Increase primaries for higher write throughput and to spread load > - Shard because you need write capacity. > - If you need better search response (latency) increase number of shards > - Replicas are not backups. Rarely see a benefit with more than 1 > -- They are for HA, not backups > -- They do not save you if "everything goes to hell" > -- Snapshots are your friends > --- Incremental, next snapshot gets only segments that are "new" > -- Oversharding is becoming an issue > --- 41k shards in a 10 node cluster is extreme > > Bad Queries > - Deep pagination > -- ES 2.0 has a soft limit on 10k hits per request. Linearly more > expensive per shard > -- Use scan and/or scroll API > - Leading wildcards > -- Equivalent to a full table scan (BAD!!) > - Scripting > -- Without parameters > -- dynamically (inline) > > Aggregations > - Search is always faster than aggregations > - Dont aggregate when you want search > > Merge throttling > - Disable it on SSD's > > Use bulk processing > - Indexing 1 by 1 you will have a bad day (if trying to index many things) > > ------------------------------- > Securing Elasticsearch > ------------------------------- > > Pitfalls securing search - or why REST based security solutions rarely work > > Why would you? > - Most of the clients are using HTTP protocol > - Easy to implement using standard HTTP reverse proxy > - Many elasticsearch operation are REST compliant > -- PUT /index/type/doc_id - indexes a document > -- GET /index/type/doc_id - gets a document > - Reverse proxy isn't enough, you also need firewalls > -- (i don't think that's a big idea, at all. Use SSL) > -- must establish security perimeter > > What does it get you? > - Works great for authentication. Access/No access to everything > - Fails for authorization > -- Role based access to certain indices, certain part of indices, certain > fields > > Limiting access to an index by using URL matching > - Theoretically easy! Just setup the following URL matching patterns: > -- GET /myindex/*/* > -- PUT /myindex/*/* > -- POST /myindex/*/* > > - But what about? > - POST /myindex/type/_bulk > - /myindex1,myindex2/*/* > - Could disable multi-index rest actions > -- rest.action.multi.allow_multi.??? = none > > Search is not always limited to the index on the url > Cross index search operation: > - More like this query > - Terms lookup mechanism in the terms query > - percolation > - you end up needing to whitelist/blacklist query patterns > > Be really careful with URL matching rules! > - Documented exceptions on the elasticsearch website > > Document level security > - A typical solution for document level access control > -- Add an ACL token(s) to each document > -- Crete an alias with ACL filters for each user > -- Limit users access to their aliases > - Works *most* of the time > -- Aliases are not considered in > --- Suggest API > --- Children Aggs > If you are trying to use aliases as security solution augment it WRT query > types > > Filtered aliases are not a security mechanism > - More like working together with other systems > > Field level security > - A naive solution - filtering of source in response > -- /_search?_source_exclude=secret_part.* > -- This is not enough, If you can search it you have access to it. There > is no exception to this. > A field in the source can appear in multiple ways > -- multi match, aggregations, etc. etc. > -- Requires basically whitelisting queries > > Securing search is difficult problem > - URL path is not a filter > - Some operations are working on entire index > - It's a search and analytics engine, which makes frequency analysis > attack so much easier > -- Lots of nice statistics like term frequency, etc. > > Is there a solution? > - Move all security to business logic layer. Deny raw access > - Move security to transport layer > -- This is what shield (now xpack) does > -- They believe that securing at the lowest level is the only way to > secure elasticsearch > - Hard to know from the REST api what will happen with a given request > - Authorization is secured at the individual TransportAction level > -- Prevents the user from accessing a terms lookup (for example) on an > index they dont have > - TLS encryption for inter-node transport and http > - Message signing means even without TLS a node cannot join the cluster > - Secured down to the lucene level - integrating with how elasticsearch > reads the index > - In a request with document level security the query defined on a role > will get added to > the index reader. Documents that arn't allowed look deleted to the > IndexReader > - Field level security is similar. At the lucene level fields look like > they don't exist > > > ------------------------------------------------------------------------------------------------ > Boosting Search Accuracy with Engine Scoring and Predictive Analytics > > ------------------------------------------------------------------------------------------------ > Paul Nelson > Chief Architect, Search Technologies > > (note to self: This will probably be advertising to hire them as > consultants) > (talked to paul after talk, he had previously talked with nik and is > interested in working with us. I think what he really wants is to use our > click through data to improve his own business case though. Tomasz will be > in contact and see if we can work together). > > - Speaker has been involved in this since mid 90s > -- NIST > - Many of these ideas came from Trek, updated with big data and logs > > - Common customer complaint > -- Q: Whats wrong? > -- A: Our search is bad > -- Q: How bad? > -- A: Bad > -- Q: Scale 1-10? > -- A: 8? 9? Lets call it 9.23 > -- (completely bogus, means nothing without statistically based numbers) > > - Golden query set > -- Key documents > -- Just because it's called golden doesn't mean its any good > - Top 100 / Top 1000 queries analysis > -- May not represent actual usage of your system > -- Taking a random selection can be more useful. For example one customer > 40% of queries were human names. Large variance didn't show up in top 1000 > queries. > -- The long tail could be 85%+ of queries > - Zero result queries > - Abandonment rate > - Queries with click > - Conversion > > These all have a problem: You need to put your search engine into > production to compute them which hurts your bottom line. > - (if you are ecommerce or advertising) > > What are we trying to achieve? > * Reliable metrics for search accuracy > -- repeatable > * Can run analysis off-line > -- Does not require production deployment (!) > * Can accurately compare two engines > -- Different technologies > -- Different versions of same engine > -- Different scoring on same engine > * Runs quickly = agility = high quality > -- Iterating engine quickly is key > * Can handle different user types / personalization > -- Broad coverage > -- Modify relevancy scoring based on source website, "cluster" user is in > * Provides lots of data to analyze what's going on > -- Data to decide how best to improve the engine > > Leverage logs for accuracy testing > Query logs, Click Logs -> Engine scoring framework -> Search engine under > evaluation > * engine scores - have one (or two, or three) top level numbers that give > your engine a score for the query set > * other metrics and histograms - generate useful metrics, such as a > histogram of where recorded click through were in the final positions > * scoring database - Record historical scores, and what changes led to > those scores. Re-review prior changes after making new changes. They might > not be as useful any more! > > From Queries -> Users > * User by User metrics > -- change in focus > -- Queries don't matter. People matter. It does *not* matter if a query > was the right answer. It matters if the user got their right answer (maybe > two queries? Not the end of the world) > * Group activity by session and/or user > -- call this an "activity set" > -- Merge session and users > * Use Big Data to analyze *ALL* users > -- There are no stupid queries and no stupid users > -- Overall performance based on the experience of the users > > Engine Score > - Group activity by session and/or user (Queries and Clicks) > -- Trying to create a model for the user. Everything the user hs found > useful > Determine "relevant" documents > -- Lots of people don't have user ids, only sessions > -- What did the user view? Add to cart? Purchase? > -- Did the search engine return what the user ultimately wanted? > > * Determine engine score per user per query > -- Σ power(FACTOR, position) * isRelevant[user, > searchResult[Q,position].DocID] > -- Evaluated for each user's point of view > -- (Note: Many other formula possible, MRR, MAP, DCG, etc.) > -- Speaker happens to like this one. Some algo's prefer top results, some > prefer results "on page" > * Average score for all user queries = user scores > * Average scores across all users = final engine score > -- Important to average per user (or per session) and not per query > -- Prevent single user with 99 queries from changing the score > > Off Line engine Analysis > - searchRersult[Q,position].DocID > -- Done by re-executing the users queries offline > > Continuous improvement cycle > Modify engine -> execute queries -> compute engine score -> evaluate > results -> back to beginning > * Kind of like sports, you can have a personal best engine score > * Scoring never stops, generating one score is not enough. Generate scores > and track over time > > What else can we do with engine scoring? > - predictive analytics > - The brutal truth about search engine scores > -- They are not based on science. > -- Random ad-hoc formula put together > - BM25 invented in 1980. Not state of the art. Predictive analytics is > > Use Big Data to predict relevancy > Document Signals, Comparison Signals, Query Signals, User Signals > > The score predicts probability of relevancy > Value from 0 -> 1 > * Can be used for threshold processing > * All documents too weak? Try something else! > * Can combine results from different sources / constructions together > - Identifies what's important > -- Machine learning optimizes for parameters > -- identifies the impact and contribution of every parameter > - If a parameter does not improve relevancy -> REMOVE IT!!! > > Come out of the darkness > - Ultimate message is use science > > > DEMO > - Note time based, just a list of queries and clicks > - Don't have to tie clicks to users > > Q/A > > - Single relevancy score not as useful - Build relevancy lab to have > multi-dimensional scores > > - td/idf may be a signal, but the resulting formula usually replaces > tf/idf completely > -- tf/idf only used to surface likely documents > > ------------------------- > NY Times Search > ------------------------- > > explicit search, formal user queries, used but not the main use case > implicit search powers much of nytimes.com > > Ingestion > --------- > Few updates - A few thousand updates per day > Sources - CMS, Legacy systems, file based archives, OCR > Latency - <1s, Low latency is important. Especially for implicit search. > > Content typically doesn't show up on website until it's indexed > > Typical Use Cases > ----------------- > - not comprehensive > - People search for something they read. They read an article they want to > read it again > -- Only 1 correct result. > -- Goal (not accomplished): recognize these queries and give a single, > correct, result > - Find something you wrote > -- Less common, but happens surprisingly often > - Find reviews for movies/books > -- Typically has a single correct result > - Find recipes > -- Similar, but more than a single correct result > - "Manual sharing" > -- Instead of sending a link to someone you tell them you read an article > on xyz and they search for it > - "Casual" news search > -- Looking for information on recent events > - Serious research > -- More interesting > -- Makes up a big amount of queries on the website > -- Typically journalists or historians > > Why not just use google? > 1. Keep the customers on site > -- Google risks the customer goes to a different site > 2. There is no google for native apps > -- Native apps need to provide good search > -- Because they have multiple apps that do slightly different things they > need special cases > 3. We know our content better > -- Most important > -- Have editorial teams, taxonomy teams. They know which article is > relevant for a specific case > -- Elsaticsearch has allowed them to tune this exactly how they want > > Search should be targeted at their most advanced readers > - A movie review can be found on google > - If you want to know the 10 most important articles on syria, NY Times > can do that much better > -- Not there yet, lots of work still to do but that is the direction they > are headed > > Our current Setup > ----------------- > - The search stack runs entirely on AWS > -- Case for NYTimes website > - Deployed using an in-house infrastructure tool > - Nagios and New Relic for monitoring > - Sumo Logic for log management and analytics > > System Setup > - Two full production clusters > -- Not just for elasticsearch, but the whole stack (ingestion pipeline, > etc) > - DNS based failover > - Also used for maintenance/major changes/reindexing > - 16 node ES clusters in production (m1.xlarge) > > CMS->SQS->Handler->Normalization->Merge->Indexer->Elasticsearch->Search > API<->Semantic platform > - Many intermediate layers read/write data from S3 > - Beanstalk handles queueing(i think?). Replacing with kafka > > > Normalization > - Large number of rules defining how to extract fields from different data > sources > - Constructs canonical document which is same for all data sources > - Canonical document pushed into ES > > Merge > - Ensures that only the latest version is indexed > > Indexer > - Pushes documents into elasticsearch and mongodb > > Search API > - powered by elasticsearch and mongodb > - For each search result looks up full document from mongodb and returns > that to client > > Two clusters makes maintenance work easy. > - The way the handler works makes it relatively easy to handle the > disparate formats > > Number of issues with this setup: > - Too many moving parts > - Too many things that can go wrong > - Difficult with this setup to index new fields > -- New fields are added quite frequently > -- Have to go back to normalization, add new fields, reindex everything > -- Many times requires going all the way back to CMS > -- Simple task that shouldn't be this hard > - Despite the fact they have two clusters they don't know if the offline > cluster is still working > -- Cache goes cold > -- Have to warm up cache before putting online > -- Offline cluster doesn't receive production traffic > -- Only discover failure when they try to fail over > - Hard to run a complex setup like this on a laptop > -- Want to run entire pipeline locally > - Overall This is a setup that is so complicated that they spend a good > amount of their time doing operations work instead of search > --Seen this issue in many places > > Future Work - Simplify > 1. Full document in elasticsearch > - store and index full documents in elasticsearch > -- Not done previously for historical reasons of support for document > sizes (alt search engine?) > - No external lookup necessary in the API, no MongoDB > - Demand driver API - clients decide what they want. > - Not in production yet, but some tests show this is better > > 2. Replay > - Use kafka > - All content will go through kafka > - Kafka will persist all content, source-of-truth for search > - Easy to create new search clusters through Kafka replay > - Goal: Never upgrade cluster. Boot new cluster, replay kafka to populate, > switch over and kill old cluster > > 3. Keep all clusters busy > - All production traffic will be replayed from active production cluster > to standby production and to staging > - Uses gor > - Makes sure that standby cache is always warm > - Changes to staging will be exposed to production traffic > - If standby cluster crashes they will know > > 4. Virtualize > - Vagrant boxes for everything > - Make it easy to run full pipeline locally. > - Provision vagrant boxes using same system as production servers > -- *exactly* the same. No differences between local and prod per-box > configuration > > CMD-Kafka->logstash->elasticsearch->Search API > - No S3. No beanstalk. Much simpler. > > Conclusion and comments > 1. Figure out how you can be better than Google > -- IF you don't think you can be better let google do it > -- Know your users > -- Search should be an integrated part of the product > --- Search should not be an afterthought > --- Plan it from the beginning > -- Be consistent across devices and platforms > -- Focus on where you can make a difference for the end user > > 2. Infrastructure is hard > Think about how to > - do deployments > -- Far too often people setup search clusters without planning how to > deploy new versions > -- Know how you can easily upgrade > - do upgrades > - reindex content > - change the way you normalize data > - create new search clusters > ... and make it easy. > > Things to do > - Put your infrastructure in version control > -- Nothing should be done from command line to setup infrastructure > -- When everything crashes, and some day it will, it needs to be easy > - Vagrant everything > - Have immutable servers, replace instead of changing > -- Only really possible on cloud based, raw hardware requires different > approach > - Make deploying a new stack easy and automatic > > Is elasticsearch for text search? > - Background for this question is they attended a one day elastic > conference in new york > -- None of the talks were about text search > -- This week is very similar > -- Looked over new features in ES2.0. Categorized into > analytics/search/other stuff(mostly cluster) > -- Found 6 issues out of 119 dealing with search. 5%. > -- Skeptical that elastic wants to focus on text search as an important > use case > -- Those features are not revolutionary in any way, just regular tweaks > > New things necessary in elasticsearch for text search: > - Query log handling > -- Need to know what users are searching for and what they are doing > -- Completely missing from elasticsearch > -- Considers it a big problem > -- Elasticsearch has plenty of ways to analyze logs. But nothing that logs > and ingests it's own search history > - Easier relevancy tuning > -- Difficult to change things. Getting the right score for one document > without changing all other searches > -- Needs tooling. Providing a set of benchmark queries > -- When you change things for one query, how does that effect other > queries? > - Improved linguistics > -- No built in language detection > -- No morphologies - lemitization > -- No compounding > -- No entity extraction > - Really wants to see some leadership from elastic on the text search space > > Q/A: > Optimizations around analyzers? > - Not much. Most is done pre-indexing by adding keywords and custom scoring > Primary data store? > - Source of truth is custom CMS > Moving away from AWS? > - not trying to, just want to make sure it's possible (using kafka instead > of SQS, etc.) > Follow up on latency, time from publish to shown on site? > - Less than 1s > > > > ------------------------------------------------------- > Geospatial data structures in elasticsearch and lucene > ------------------------------------------------------- > > Geo field types > -geo point > -- used to have 8 or 9 paramteres, as of ES 2.2 only 3 parameters necesary > -- other still there, but deprecated and being removed > -- type, geohash_prevision, geohash_prefix, ignore_malformed all that is > left > -- only type, ignore_malformed is being kept > - geo shapes > -- used to have 12 parameters, many expert only > -- going forward: type, ignore_malformed, tree, precision, > distance_error_pct, orientation, ???_only > -- orientation going away. standards define a standard: right hand rule. > -- es5.0 iso complient, will reject uncomplient > -- points only allows short circuit quad tree logic, optimizes for point > only types > --- points only vs points index is 10-15% better performance > --- only geo shape no points only, > OGC simple feature access > ISO Geographic information - spatial schema 19107:2003 > > Geo indexing > GeoPointField change in lucene 5.4 > - create a quad tree encoding, first bit is world level lat, next bit > world level longitude > -- rotates back and forth each two bits > -- Space filling curve, dimensional reduction > -- Their implementation is Z-curve, morton curve. Nearest neighbors are > not necessarily so close. > -- investigation hilbert curve. Current is 4 operations to encode, hilbert > takes longer. Tradeoff of accuracy/locality vs performance > -- precision step parameter effects size of index. More precision -> more > terms > -- "optimal" precision step for smallest index possible. Currently 9 > > 2.1 -> 2.3 large change in performance of geo. ~30% index size, 12% heap, > 40% time (indexing or query?) > > 5.0 -> geo shapes and geo points merged as geo @experimental > - balanced k-d trees > > Geo search > > Geo Aggrgations > > --------- > BM25 > --------- > > A free text search is very inaccurate description of our information need > What you want > - quick learner > - works hard > - reliable > - enduring > - ... > > But you type: "hard-working, self-motivated, masochist" > > The purpose of this talk > - know the monster, understand what the parameters of BM25 do > - know why it has the label "probabilistic" > - mathematical concepts behind it: more for your entertainment > - be convinced that switching to BM25 is the right thing to do > -- hardest part of this talk > - be able to impress people with you in depth knowledge of probabilistic > scoring > > The current default - TF/IDF > - search in self-description of application for these words > - we want to order applications by their relevance to the query > > Evidence for relevance - term frequencies > - Use term frequencies in description, title, etc. > > Major tweaks: > - term freq: more is better > -- the more often it occurs the more relevant this document is > -- square root of the term frequency. Sum it up > -- increases non linearly > - inverse document frequency > -- common words are less important > -- a, is, the, etc. > -- also non-linear: multiple tf with idf and sum that > - long documents with same tf are less important: norm > -- also non-linear, idf is more L shaped than this though > > Bool query and the coord-factor > - doc a: holiday: 4, china: 5 > - doc b: holiday: 0 china: 15 > - Coord factor: reward document a with both terms matched > -- basically a hack > > TF/IDF > - successful since the beginning of lucene, since first release > - well studied > - easy to understand > - one size fits most > > So...can we do better? > - It is somewhat a guess > - sum square root of term frequencys, which was a guess of an original > developer > -- ad hoc > > Probabilistic ranking and how it led to BM25 > The root of BM25: probability ranking principle (abridged) > -"if retrieved documents are ordered by decreasing probability of > relevance on the data available, then the system's effectiveness is the > best that can be obtained for the data." > - Chance to deploy prior knowledge of relevance and scoring to make > results better > - if mapped to mathematical framework can take advantage of mathemticians > past proofs > > Getting into math: > 1. Don't care how much heap, how much cpu's. Get to that later, forget > about it now > 2. Think of yourself as super smart, beyond 2 standard deviations > > Estimate relevancy: > - simplification: relevance is binary! > - get a dataset queries - relevant / irrelevant documents > -- use that to estimate relevancy > - how to use this to actually get a better scoring? > - P(R=1|d,q) > -- P(A|B) = probability of A given B, R = relevancy (1/0), d = document > - For each document, query pair - what is the probability that the > document is relevant > -- Need some sort of matrix, document vs query > ...here be math... > - paper: The probabilistic relevance framework: bm25 and beyond > - stephen robertson and hugo zaragoza > ..and we get to.. > - W(d) = sum log( P(..)P(..) / P(..)P(..) ) > > BM25: how to estimate all these probabilities > - two assumptions > -- the binary independence model - a dramatic but useful simplification > -- query term occurs in a document or doesn't, we don't care how often > > - Robertson/Sparck Jones weight > -- another equation > -- assumes unlimited supply of interns > - simplified > -- still use robertson/sparck jones weight but assume that the number of > relevant documents is negligible (R=0,r=0) > > - BM25 has S shape, goes to zero. tf/idf is L shaped > > - In TF/IDF the more often the term occurs the better > - but, is a document about a term just because it occurs a certain number > of times? > - this property is called eliteness > -- in the literature > > Example for "eliteness" > - "tourism" > - Look at wikipedia: Many documents are about tourism > - Many documents contain the word tourism - but are about something > completely different, like for example just a country > > Can we use prior knowledge on the distribution to ... > > Eliteness as Poisson Distribution > Two cases: > - document is not about the term > - document is about the term > > How to estimate this? > - father data on eliteness for term > - many term frequencies -> do for many documents > > How relevance ties into that > - suppose we knew the relationship of ??? > > .... > > Tf saturation curve > - limits influence of tf > - allows to tune influence by tweaking k > - major difference from tf to bm25 > > So...we assume all documents have same length? > - poisson distribution: assumes a fixed length of documents > - but they don't have that (most of the time) > - we have to incorporate this too! > - scale tf by it like so: (long equation) > -- includes interpolation between 1 and document length over average > document length > > Influence of b: length weighting > - tweak influence of document length > - similar to tf/idf, depending on b value > > Is BM25 probabilistic? > - many approximations > - really hard to get the probabilities right even with unlimited data > -- mathematicians have been trying to solve this for years, no success > - BM25 is "inspired" by probabilistic ranking > > > -- 1992 Trec-2 took a "leap of faith" (?) > -- 1993 Trec-3 BM25 final! > -- 1999 First lucene release (tf/idf) > -- 2011 Pluggable similarities + BM25 in lucene > > So..will I get better scoring with BM25? > > Pros with the frequency cutoff: > - TF/IDF: common words can still influence score > - BM25: limits influence of term frequency > -- less influence of common terms > -- no more coord factor! > -- check if you should disable coord for bool queries? > --- index.similarity.default.type: BM25 > > Other benefits: > > - parameters can be tweaked. Not sure if many people actually need to. You > need data to measure if scoring is getting better or not. Do not tweak > without data > To update: > - close index > - update mapping (or settings) > - re-open index > > Mathematical framework to include non-textual features > - go read the paper > > A warning: lower automatic boost for short fields > - With TF/IDF: short fields (title,...) are automatically scored higher > - BM25: Scales field length with average > > -- field length treatment does not automatically boost short fields (you > have to explicitly boost) > > Is BM25 better? > - literature suggests so > - challenges suggests so (TREC, ...) > - Users say so > - Lucene developers say so > - Konrad Beiske says so: "BM25 vs Lucene default similarity" > - But: It depends on the features of your corpus. > - Finally: You can try it out now! Lucene stores everything necessary > already > > Useful literature: > - Manning et al, Introduction to Information rerieval > - Robertson and Zargoza, The probabilistic Relvance Framework: BM25 and > Betond > - Robertson et al, Okapi at TREC-3 > > ---------------------------- > Quantitative Cluster Sizing > ---------------------------- > > 1. Understanding why "it depends" > Factors > - Size of shards > - Number of shards on each node > - Size of each document > - Mapping configuration > -- which fields are searchable > -- automatic multi-fields > -- whether messages and _all are enabled > - Backing server capacity (SSD vs HD, CPU, etc.) > > Your organization requirements / SLAs > - retention period of data > - ratio and quantity of index vs search > - nature of use case > - continuous vs bulk indexing > - kinds of queries being executed > - desired response time for queries that are run frequent vs occasionally > - required sustained vs peak indexing rate > - budget and failure tolerance > > Let's try to determine > - How much disk storage will N documents require > - When is a single shard too big for my requirements > -- When will that shard reach a point where the search queries will no > longer be within SLA > - How many active shards saturate my particular hardware > -- Keep putting shards on a node? Back it down > - How many shard/nodes will i need to sustain X index rate and Y search > response > > 2. Sizing methodology > - run 4 experiements > - 1. Determine various disk utilization > -- Use a single node cluster with one index > -- 1 primary, 0 replica > -- Index a decent amount of data (1GB or about 10 million docs) > -- Calculate sotrage on disk > --- Mostly for logging use case, before and after _forcemerge > --- > -- Repeat the above calculations with different mapping configurations > --- _all both enabled and disabled > --- settings for each field > - 2. Determine breaking point of a shard > -- Use a single node cluster with one index: 1 primary, 0 replica > -- Index realistic data and use realistic queries > -- Plot index speed and query response time > --- Build some dashboards. Latency increases with shard size > -- Determine where point of diminishing returns is for your requirements > > - 3. Determine saturation point of a node > -- USe a single node cluster with one index > --- 2 primary, 0 replica > --- repeat experiment two to see how performance varies > --- keep adding more shards to see when point of diminishing returns occurs > > - 4. Test desired configuration on small cluster > -- Configure small representative cluster > -- Add representative data volume > -- Run realistic benchmarks: > --- Max indexing rate > --- Querying across varying data volumes > --- Benchmark concurrent querying and indexing at various levels > -- Measure resource usage, overall docs, disk usage, etc. > > 3. Scenario and experiment results > - Practical Sizing Example > -- show some patterns, draw some conclusions about how it works > -- no numbers, all your numbers will be different anyways > > - Graph latency, record count, shard size > -- Latency increases with record count proportionally > > - ... lots of stuff about log ingestion use case ... > > 4. Interpreting results and expanding to other scenarios > > ... > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > > -- @wikidata
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
